Giter Site home page Giter Site logo

datanymizer / datanymizer Goto Github PK

View Code? Open in Web Editor NEW
496.0 6.0 29.0 1.15 MB

Powerful database anonymizer with flexible rules. Written in Rust.

Home Page: https://datanymizer.github.io/docs/

License: MIT License

Dockerfile 0.08% Rust 98.79% Shell 1.13%
database-dump fake-data fake-generator dump-data dumper database-anonymizer postgresql-database database hacktoberfest

datanymizer's Issues

Execution order for tables

Is there a way to define an execution sequence for the tables? As the values for the store are not stored before the read operation starts. Swapping the sequence in the config works, but it is inconsistent and fails after some runs.

tables:
  - name: User
  ...
    template:
      format: "{{ store_read(...) }}"
  ...
  - name: Profile
  ...
    template:
      format: "{{ store_write(...) }}"
  ...

pg_datanymizer not work in container

I run demo as it say in demo/README.md.
I build a container with pg_datanymizer in and run via make dump
But I get an error:

Error: invalid type: unit value, expected struct FirstNameTransformer

If I comment in dvdrental.yml block for first_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
      last_name:
        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

Then i get same error with last_name:

Error: invalid type: unit value, expected struct LastNameTransformer

After add comment for last_name, like this:

filter:
  only:
    - public.actor
tables:
  - name: actor
    rules:
#      first_name:
#        first_name: ~
#      last_name:
#        last_name: ~
      last_update:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

All works fine.

When I build container with postgresql and download pg_datanymizer CLI into it all works fine!

My Dockerfile for example - Link

Builded container in DockerHub - agmayer/pg_datanymizer

In-place data modification

Hi! Is it possible to make it work to change data in-place (without dump-restore cycle)?
We already have automated backups-to-staging restoration cycle (postgres, basebackups) and would like to use this tool to make data masking in-place.

Issue with reading from key value store

Hi @evgeniy-r , I am facing some issues while using the key-value store. please correct me if I am using it wrong.
User is connected to Profile with a foreign key in the User table. So I am using profile_id while storing the data to the store and then while retrieving I am just using the id from the Profile table.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write(key='user_phonenumbers.' ~ prev.profile_id, value=_1) }}"
          rules:
            - phone:
                format: "+############"
                uniq: true
  - name: Profile
    rules:
      phone:
        # reading phone numbers from `user_phonenumbers` stored in `User`
        template:
          format: "{{ store_read(key='user_phonenumbers.' ~ prev.id) }}"

Output

...
Prepare to dump table: default$default.Profile
Error: Failed transform Failed to render 'TemplateTransformerTemplate'
ERROR: Job failed: exit code 1
FATAL: exit code 1         

Just to confirm that whether the template is giving an error on reading, I tried removing the template under Profile and replaced it with phone to randomize it and it works that way. So I am quite sure that there is something wrong - either how I am implementing it or with the anonymizer.

Failed after schema change.

DB:

create table "user"
(
	id serial,
	first_name text,
	last_name text,
	email_field text
);

insert into "user" (first_name, last_name, email_field) VALUES ('A', 'B', '[email protected]');
alter table "user" drop column last_name;

Config:

tables:
  - name: "user"
    rules:
      email_field:
        email:
          kind: Safe

Error:

thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 3', datanymizer_engine/src/engine.rs:40:25

Malformed query repeats one field multiple times

We have a table with the following 30 fields:
image

This fails and datanymizer exits because an internal untransformed query for the dump is repeating the event_type field 3 times in the COPY:

None ret from qw.copy_out(untransformed_query=

"COPY "metrics"."deployment"("deployment_id", "hypervisor_key", "specification_id", "vmware_deployment_target_id",
     "vcenter_url", "vcenter_cluster", "vcenter_folder", "clone_source_id", "event_type", "event_type", "event_type",
     "org_name", "org_id", "mto_node_id", "mto_node_path", "mto_node_org_path", "academy_event_id", "package_id",
     "package_estimated_duration_in_seconds", "course_plan_id", "deployer_user_id", "user_full_name", "user_email",
     "warden_event_id", "created_at", "last_updated", "project_id", "project_org_id", "project_org_name",
     "project_mto_node_id", "project_mto_node_path", "project_mto_node_org_path") TO STDOUT
"),

table=metrics.deployment,
count=0,
error=Error {
    kind: Db,
    cause: Some(DbError {
        severity: "ERROR",
        parsed_severity: Some(Error),
        code: SqlState(E42701),
        message: "column \"event_type\" specified more than once",
        detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None,
        datatype: None, constraint: None, file: Some("copy.c"), line: Some(5155),
        routine: Some("CopyGetAttnums")
    })
}

Working now to understand why and fix it.

Add template references to other transformed values of the same row

Possible syntax (look at tr):

tables:
  - name: users
    rules:
      login:
        username: {}
      preferences:
        template:
          format: '{"some_complex_json": {"field": "{{tr.login}}"}}'

Tera allows also tr["login"].

The name tr means "transformed row". We can use tr_row or transformed_row instead (it might be too long).

We can also use just row, but what if in the future we decide to use the values โ€‹โ€‹of the original row too?

Make bootstrap in demo dir isn't working?

Discussed in #107

Originally posted by marcelofernandez October 5, 2021
Seems like the demo docker environment isn't working:

marcelo@marcelo:~/dev/datanymizer/demo/$ make bootstrap
[...]
   Compiling nom v5.1.2
   Compiling openssl-sys v0.9.63
error: failed to run custom build command for `openssl-sys v0.9.63`

Caused by:
  process didn't exit successfully: `/usr/src/target/release/build/openssl-sys-9793598f3bd829c9/build-script-main` (exit code: 101)
  --- stdout
  cargo:rustc-cfg=const_fn
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_LIB_DIR
  OPENSSL_LIB_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_INCLUDE_DIR
  OPENSSL_INCLUDE_DIR unset
  cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR
  X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_DIR
  OPENSSL_DIR unset
  cargo:rerun-if-env-changed=OPENSSL_NO_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64-unknown-linux-musl
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64_unknown_linux_musl
  cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_SYSROOT_DIR
  cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR
  run pkg_config fail: "pkg-config has not been configured to support cross-compilation.\n\n                Install a sysroot for the target platform and configure it via\n                PKG_CONFIG_SYSROOT_DIR and PKG_CONFIG_PATH, or install a\n                cross-compiling wrapper for pkg-config and set it via\n                PKG_CONFIG environment variable."

  --- stderr
  thread 'main' panicked at '

  Could not find directory of OpenSSL installation, and this `-sys` crate cannot
  proceed without this knowledge. If OpenSSL is installed and this crate had
  trouble finding it,  you can set the `OPENSSL_DIR` environment variable for the
  compilation process.

  Make sure you also have the development packages of openssl installed.
  For example, `libssl-dev` on Ubuntu or `openssl-devel` on Fedora.

  If you're in a situation where you think the directory *should* be found
  automatically, please open a bug at https://github.com/sfackler/rust-openssl
  and include information about your system as well as this message.

  $HOST = x86_64-unknown-linux-gnu
  $TARGET = x86_64-unknown-linux-musl
  openssl-sys = 0.9.63

  ', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/openssl-sys-0.9.63/build/find_normal.rs:174:5
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: build failed`

I'm trying this out in an Ubuntu 20.04 + Docker 20.10.9~3-0~ubuntu-focal setup, on an x86_64 AMD machine...

postgresql - same user-defined-type present in multiple schema breaks the process

Folks,

I have a situation where the same user-defined-type name is re-used across two different schema in the same database, and it breaks the process.

i think the query that constructs the columns of a table needs a small tweak :

Line 32 on schema_inspector.rs needs to change from :

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   ORDER BY cc.ordinal_position ASC";

to

const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
                                   FROM information_schema.columns as cc
                                   JOIN pg_catalog.pg_type as pt
                                   ON cc.udt_name = pt.typname
                                   LEFT JOIN pg_catalog.pg_namespace as pn
                                   ON (
                                        cc.data_type = 'USER-DEFINED'
                                        AND cc.table_schema = pn.nspname
                                        and pn.oid = pt.typnamespace
                                      )
                                   WHERE cc.table_schema = $1 and cc.table_name = $2
                                   AND ( cc.data_type <> 'USER-DEFINED' OR pn.nspname is not null )
                                   ORDER BY cc.ordinal_position ASC";

Mainly, it is ensuring that the schema name is used when finding the type of a column, when the type is user-defined.

Malformed query when table has no defined fields

Our database uses timescaledb, and there's a schema called _timescaledb_cache with three tables: cache_inval_bgw_job, cache_inval_extension, and cache_inval_hypertable. For some reason there are no fields defined in these tables.

image

When running pg_datanymizer against this, it fails with an error that made it hard to know which table was the cause (we have many schemas and tables). Running rust nightly with RUST_BACKTRACE=full allowed me to see the offending line and modified pg_datanymizer a bit to print out the offending query.

    (prior output deleted)

[Dumping: _timescaledb_internal._hyper_9_327_chunk] [|##################################################|] 0 of 0 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] Finished in 0 seconds
[Dumping: _timescaledb_internal._hyper_15_976_chunk] [|##################################################|] 68288 of 68288 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_15_976_chunk] Finished in 2 seconds
Error: db error: ERROR: syntax error at or near ")"

Caused by:
ERROR: syntax error at or near ")"

Stack backtrace:
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:328:13
3: std::backtrace::Backtrace::capture
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:296:9
4: anyhow::error::<impl core::convert::From for anyhow::Error>::from
at /Users/beynon/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.33/src/error.rs:464:25
5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Resultcore::convert::Infallible,E>>::from_residual
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/core/src/result.rs:2052:27
6: datanymizer_dumper::postgres::dumper::PgDumper<W,I>::dump_table
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:109:26
7: <datanymizer_dumper::postgres::dumper::PgDumper<W,I> as datanymizer_dumper::Dumper>::data
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:173:17
8: datanymizer_dumper::Dumper::dump
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/lib.rs:21:9
9: pg_datanymizer::app::App::run
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/app.rs:34:31
10: pg_datanymizer::main
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/main.rs:13:5

One of the offending queries is:

COPY "_timescaledb_cache"."cache_inval_hypertable"() TO STDOUT

Solution seems to be if the column list is empty, the parentheses must be omitted:

COPY "_timescaledb_cache"."cache_inval_hypertable" TO STDOUT

I can easily skip these tables with an "except:" in the config.yaml, but then the pg_dump output is missing a table that may still be required by timescaledb to function properly.

Create example for pg_datanymizer

Create example to introduce how to get starting with CLI tool pg_datanymizer.
Setup environment with docker-compose, test database dump and example config file.

Postgres Dump missing Create statements

When I run pg_datanymizer against a GCP managed sql instance, I only get a data dump, it does not add any create statements for the relations.

Running pg_dump directly against the same database with the same credentials gives a full dump as expected.

The command I am running is
pg_datanymizer db --config /app/config.yml --host cloud-sql-proxy --username <redacted> --password <redacted> --file /app/dump.sql --port 5432

The config just has table rules, no filters or anything like that.

Would appreciate any guidance on this.

Feature Request: wildcards in filters

Currently, I have 100s of tables in my database, all under different schemas.
I want to anonymize around 5 - 10 of my tables. I am able to achieve this using the following config

.....
filter:
  data:
    only:
      - "default$default.Profile"
      ...

Potentially, this creates another issue as it copies over the other schema, which I don't want to copy.

When I try to use the following config:

.....
filter:
  schema:
    only:
      - "default$default.Profile"
      ...

Only the tables selected get copy but not their dependencies, for e.g. ENUMs used in those tables.

So I would like to request for a feature in which filter accepts wildcard * for anything under a schema.
something like:

filter:
  schema:
    only:
      - "default$default.*"
    except:
      - "public.*"

Thank you in advance.

Add the limit when dumping data

Possible config syntax:

tables:
  - users:
    # per table
    limit: 10000
    # ........................
default:
  # global
  limit: 1000

We can also think about more flexible customization - specifying an SQL query for the table.

Error: no matching tables were found

@evgeniy-r I am facing an issue while using the cli with my db. I am not sure that whether I am using the config correctly, can you help me rectify the issue.

thank you in advance.

Background

I have a database with multiple schemas and I want to copy over the schema default$default and the tables under that schema. Of course, I also anonymize certain fields in the tables.

config.yml

tables:
  - name: default$default.Profile
    rules:
      firstName:
        first_name: {}
        ...
        ...
filter:
  schema:
    only:
      - default$default

Output

$ ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml --accept_invalid_certs $POSTGRES_CONNECTION_STRING
Prepare data scheme...
pg_dump error. Command:
pg_dump --section pre-data -t default$default <POSTGRES_CONNECTION_STRING>
Output:
pg_dump: error: no matching tables were found

Release v0.1.0

What needs to be done for the v0.1.0 release.

Infrastructure:

  • Set up CI (check, fmt, build, clippy, test, audit) (#7);
  • Automate release process (linux, darwin, docker) (#13);
  • Create Homebrew formula;

Improvements:

  • Make uniqueness be available for other transformers (#8);
  • Add tables exclusion and inclusion at the schema creation level (#19);

Add tables exclusion and inclusion at the schema creation level

Now, when filter added to the list of tables, Its filter only data copying process. The data schema needs to be responsive to the filter as well. To do this, you need to add arguments to the call of the command pg_dump

There are 2 steps with using pg_dump cli application.
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L62
and
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L134

Need to add inclusion and exclusion support to it.

  • Including
  • Excluding

Panic on TemplateTransformerTemplate

I am getting a panic when I try to store some values in the key-value store using the example mentioned in the docs.

Config

tables:
  - name: User
    rules:
      phone:
        template:
          format: "{{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}"
          rules:
            - phone:
                format: "+############"
...........

Output

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Msg("Failed to parse 'TemplateTransformerTemplate'"), source: Some(Error { kind: Msg(" --> 1:24\n  |\n1 | {{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}\n  |  ^---\n  |\n  = expected `or`, `and`, `not`, `<=`, `>=`, `<`, `>`, `==`, `!=`, `+`, `-`, `*`, `/`, `%`, a filter, or a variable end (`}}`)"), source: None }) }', datanymizer_engine/src/transformers/template/mod.rs:64:59
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/bin/bash: line 54:    28 Aborted                 ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml $SOURCE_POSTGRES_CONNECTION_STRING --accept_invalid_certs=true

Adding if-conditions for column transformation

It will be good to skip transformations using if-confitions... for example

phone:
  rules:
    conditions:
      - if: "{% condition statement-1 %}"
        rules:
           - format:
               template: "bla-bla-bla"
      - if: "{% condition statement-2 %}"
        rules:
           - format:
               template: "my cool template"
      - rules:
           - format:
               template: "bla-bla-bla"

Fetch tables metadata takes too long

Fetch tables metadata...
takes more time than a full database dump.

I have filters in place for only 4 tables I need out of 100+ but this still takes longer than I expect.

Sharing information between tables

As mentioned in the readme there is a possible way to share information between rows. I was wondering if is it possible to share information between tables? My usecase has a need to keep a value consistent in table1 and table2.

For example:
I have 2 tables named Table1 and Table2. Both of them have a field emailId in them and I want to anonymize them and keep the value consistent? is that possible?

Dump includes extra value per row for COPY on table I have configured

It appears to impact fields that I have configuration for:

tables:
  - name: user
    rules:
      name:
        person_name: {}
      birthdate:
        datetime:
          from: 1980-01-01T00:00:00+00:00
          to: 2021-07-24T00:00:00+00:00

When loading the dump I see this error:

COPY 0
ERROR:  invalid input syntax for type timestamp with time zone: "Vanessa Wyman"
CONTEXT:  COPY user, line 1, column birthdate: "Vanessa Wyman"
COPY public.user("id", "firebase_uid", "created_at", "updated_at", "username", "bio", "verified", "flagged", "private_account", "avatar_image_id", "user_role_type", "avatar_foreground_color", "avatar_background_light_color", "avatar_background_dark_color", "name", "birthdate", "last_seen", "google_iap_uid") FROM STDIN;
2afa72c2-f06a-4bcd-88e8-6bac7b32fa80	ANDROID- 17	2021-06-14 19:09:17.508424+00	2021-06-14 19:09:17.508424+00	somename	\N	f	f	f	\N	USER	\N	\N	\N	\N	Vanessa Wyman	2004-01-13T00:14:00+00:00	\N

datetime format issues

      birth_month: 
        datetime:
          format: '%m'

throws error Error: Failed transform trailing input

      birth_month: 
        datetime:
          format: %m

throws error Error: unexpected character: '%' at line 68 column 19 in config.yaml

SSL support

I am trying to use datanymizer in the environment that requires SSL connection (Heroku postgres). It fails with an error:

$ docker run --rm -v /Users/f213/prj/education/dev-db:/app -w /app datanymizer/pg_datanymizer "postgres://user:pwd@<REDACTED>.eu-west-1.compute.amazonaws.com:5432/<REDACTED>?sslmode=require"
Error: error performing TLS handshake: no TLS implementation configured

Caused by:
    no TLS implementation configured

Any chances this great project will support SSL?

RUSTSEC-2020-0146: arr! macro erases lifetimes

arr! macro erases lifetimes

Details
Package generic-array
Version 0.12.3
URL fizyk20/generic-array#98
Date 2020-04-09
Patched versions >=0.14.0
Unaffected versions <0.8.0

Affected versions of this crate allowed unsoundly extending
lifetimes using arr! macro. This may result in a variety of
memory corruption scenarios, most likely use-after-free.

See advisory page for additional details.

Set up CI

Set up GitHub Actions to run these commands:

  • check
  • fmt
  • build
  • clippy
  • test
  • audit

Possible to exclude owner/privilege information

Using pg_dump and pg_restore you can exclude owner/priv information using --no-owner and --no-privileges. Using psql < dump to do restores as documented, doesn't allow this type of thing. Is this possible without having to manually text process the dump file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.