datanymizer / datanymizer Goto Github PK
View Code? Open in Web Editor NEWPowerful database anonymizer with flexible rules. Written in Rust.
Home Page: https://datanymizer.github.io/docs/
License: MIT License
Powerful database anonymizer with flexible rules. Written in Rust.
Home Page: https://datanymizer.github.io/docs/
License: MIT License
Add dumper for MySQL (analogue of datanymizer_dumper::postgres
).
Is there a way to define an execution sequence for the tables? As the values for the store are not stored before the read operation starts. Swapping the sequence in the config works, but it is inconsistent and fails after some runs.
tables:
- name: User
...
template:
format: "{{ store_read(...) }}"
...
- name: Profile
...
template:
format: "{{ store_write(...) }}"
...
I run demo as it say in demo/README.md.
I build a container with pg_datanymizer in and run via make dump
But I get an error:
Error: invalid type: unit value, expected struct FirstNameTransformer
If I comment in dvdrental.yml
block for first_name
, like this:
filter:
only:
- public.actor
tables:
- name: actor
rules:
# first_name:
# first_name: ~
last_name:
last_name: ~
last_update:
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
Then i get same error with last_name
:
Error: invalid type: unit value, expected struct LastNameTransformer
After add comment for last_name
, like this:
filter:
only:
- public.actor
tables:
- name: actor
rules:
# first_name:
# first_name: ~
# last_name:
# last_name: ~
last_update:
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
All works fine.
When I build container with postgresql and download pg_datanymizer CLI into it all works fine!
My Dockerfile for example - Link
Builded container in DockerHub - agmayer/pg_datanymizer
Hi! Is it possible to make it work to change data in-place (without dump-restore cycle)?
We already have automated backups-to-staging restoration cycle (postgres, basebackups) and would like to use this tool to make data masking in-place.
Generate fake company names, based on CompanyName
faker from fake-rs
Now, uniq: true
available for phone
transformer only. Make it be available for others (email
, ip
, random_num
).
email
ip
random_num
Hi @evgeniy-r , I am facing some issues while using the key-value store. please correct me if I am using it wrong.
User
is connected to Profile
with a foreign key in the User
table. So I am using profile_id
while storing the data to the store and then while retrieving I am just using the id
from the Profile
table.
tables:
- name: User
rules:
phone:
template:
format: "{{ _1 }}{{ store_write(key='user_phonenumbers.' ~ prev.profile_id, value=_1) }}"
rules:
- phone:
format: "+############"
uniq: true
- name: Profile
rules:
phone:
# reading phone numbers from `user_phonenumbers` stored in `User`
template:
format: "{{ store_read(key='user_phonenumbers.' ~ prev.id) }}"
...
Prepare to dump table: default$default.Profile
Error: Failed transform Failed to render 'TemplateTransformerTemplate'
ERROR: Job failed: exit code 1
FATAL: exit code 1
Just to confirm that whether the template is giving an error on reading, I tried removing the template
under Profile
and replaced it with phone
to randomize it and it works that way. So I am quite sure that there is something wrong - either how I am implementing it or with the anonymizer.
DB:
create table "user"
(
id serial,
first_name text,
last_name text,
email_field text
);
insert into "user" (first_name, last_name, email_field) VALUES ('A', 'B', '[email protected]');
alter table "user" drop column last_name;
Config:
tables:
- name: "user"
rules:
email_field:
email:
kind: Safe
Error:
thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 3', datanymizer_engine/src/engine.rs:40:25
Use the existing database schema, but generate new data (specified number of rows).
We have a table with the following 30 fields:
This fails and datanymizer exits because an internal untransformed query for the dump is repeating the event_type
field 3 times in the COPY:
None ret from qw.copy_out(untransformed_query=
"COPY "metrics"."deployment"("deployment_id", "hypervisor_key", "specification_id", "vmware_deployment_target_id", "vcenter_url", "vcenter_cluster", "vcenter_folder", "clone_source_id", "event_type", "event_type", "event_type", "org_name", "org_id", "mto_node_id", "mto_node_path", "mto_node_org_path", "academy_event_id", "package_id", "package_estimated_duration_in_seconds", "course_plan_id", "deployer_user_id", "user_full_name", "user_email", "warden_event_id", "created_at", "last_updated", "project_id", "project_org_id", "project_org_name", "project_mto_node_id", "project_mto_node_path", "project_mto_node_org_path") TO STDOUT "), table=metrics.deployment, count=0, error=Error { kind: Db, cause: Some(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(E42701), message: "column \"event_type\" specified more than once", detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("copy.c"), line: Some(5155), routine: Some("CopyGetAttnums") }) }
Working now to understand why and fix it.
Possible syntax (look at tr
):
tables:
- name: users
rules:
login:
username: {}
preferences:
template:
format: '{"some_complex_json": {"field": "{{tr.login}}"}}'
Tera allows also tr["login"]
.
The name tr
means "transformed row". We can use tr_row
or transformed_row
instead (it might be too long).
We can also use just row
, but what if in the future we decide to use the values โโof the original row too?
Originally posted by marcelofernandez October 5, 2021
Seems like the demo docker environment isn't working:
marcelo@marcelo:~/dev/datanymizer/demo/$ make bootstrap
[...]
Compiling nom v5.1.2
Compiling openssl-sys v0.9.63
error: failed to run custom build command for `openssl-sys v0.9.63`
Caused by:
process didn't exit successfully: `/usr/src/target/release/build/openssl-sys-9793598f3bd829c9/build-script-main` (exit code: 101)
--- stdout
cargo:rustc-cfg=const_fn
cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR
X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_LIB_DIR unset
cargo:rerun-if-env-changed=OPENSSL_LIB_DIR
OPENSSL_LIB_DIR unset
cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR
X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_INCLUDE_DIR unset
cargo:rerun-if-env-changed=OPENSSL_INCLUDE_DIR
OPENSSL_INCLUDE_DIR unset
cargo:rerun-if-env-changed=X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR
X86_64_UNKNOWN_LINUX_MUSL_OPENSSL_DIR unset
cargo:rerun-if-env-changed=OPENSSL_DIR
OPENSSL_DIR unset
cargo:rerun-if-env-changed=OPENSSL_NO_PKG_CONFIG
cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64-unknown-linux-musl
cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS_x86_64_unknown_linux_musl
cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_ALLOW_CROSS
cargo:rerun-if-env-changed=PKG_CONFIG_ALLOW_CROSS
cargo:rerun-if-env-changed=PKG_CONFIG_x86_64-unknown-linux-musl
cargo:rerun-if-env-changed=PKG_CONFIG_x86_64_unknown_linux_musl
cargo:rerun-if-env-changed=TARGET_PKG_CONFIG
cargo:rerun-if-env-changed=PKG_CONFIG
cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64-unknown-linux-musl
cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR_x86_64_unknown_linux_musl
cargo:rerun-if-env-changed=TARGET_PKG_CONFIG_SYSROOT_DIR
cargo:rerun-if-env-changed=PKG_CONFIG_SYSROOT_DIR
run pkg_config fail: "pkg-config has not been configured to support cross-compilation.\n\n Install a sysroot for the target platform and configure it via\n PKG_CONFIG_SYSROOT_DIR and PKG_CONFIG_PATH, or install a\n cross-compiling wrapper for pkg-config and set it via\n PKG_CONFIG environment variable."
--- stderr
thread 'main' panicked at '
Could not find directory of OpenSSL installation, and this `-sys` crate cannot
proceed without this knowledge. If OpenSSL is installed and this crate had
trouble finding it, you can set the `OPENSSL_DIR` environment variable for the
compilation process.
Make sure you also have the development packages of openssl installed.
For example, `libssl-dev` on Ubuntu or `openssl-devel` on Fedora.
If you're in a situation where you think the directory *should* be found
automatically, please open a bug at https://github.com/sfackler/rust-openssl
and include information about your system as well as this message.
$HOST = x86_64-unknown-linux-gnu
$TARGET = x86_64-unknown-linux-musl
openssl-sys = 0.9.63
', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/openssl-sys-0.9.63/build/find_normal.rs:174:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: build failed`
I'm trying this out in an Ubuntu 20.04 + Docker 20.10.9~3-0~ubuntu-focal setup, on an x86_64 AMD machine...
Folks,
I have a situation where the same user-defined-type name is re-used across two different schema in the same database, and it breaks the process.
i think the query that constructs the columns of a table needs a small tweak :
Line 32 on schema_inspector.rs needs to change from :
const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
FROM information_schema.columns as cc
JOIN pg_catalog.pg_type as pt
ON cc.udt_name = pt.typname
WHERE cc.table_schema = $1 and cc.table_name = $2
ORDER BY cc.ordinal_position ASC";
to
const TABLE_COLUMNS_QUERY: &str = "SELECT cc.column_name, cc.ordinal_position, cc.data_type, pt.oid
FROM information_schema.columns as cc
JOIN pg_catalog.pg_type as pt
ON cc.udt_name = pt.typname
LEFT JOIN pg_catalog.pg_namespace as pn
ON (
cc.data_type = 'USER-DEFINED'
AND cc.table_schema = pn.nspname
and pn.oid = pt.typnamespace
)
WHERE cc.table_schema = $1 and cc.table_name = $2
AND ( cc.data_type <> 'USER-DEFINED' OR pn.nspname is not null )
ORDER BY cc.ordinal_position ASC";
Mainly, it is ensuring that the schema name is used when finding the type of a column, when the type is user-defined.
This is necessary when we have many rows, and the faker itself cannot provide enough unique emails (this is a quiet typical situation).
email: {}
Sample email: [email protected]
email:
prefix: 5 # prefix length
Sample email: [email protected]
email:
postfix: 6 # postfix length
Sample email: [email protected]
Our database uses timescaledb, and there's a schema called _timescaledb_cache
with three tables: cache_inval_bgw_job
, cache_inval_extension
, and cache_inval_hypertable
. For some reason there are no fields defined in these tables.
When running pg_datanymizer against this, it fails with an error that made it hard to know which table was the cause (we have many schemas and tables). Running rust nightly with RUST_BACKTRACE=full
allowed me to see the offending line and modified pg_datanymizer a bit to print out the offending query.
(prior output deleted)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] [|##################################################|] 0 of 0 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_9_327_chunk] Finished in 0 seconds
[Dumping: _timescaledb_internal._hyper_15_976_chunk] [|##################################################|] 68288 of 68288 rows [100%] (0s)
[Dumping: _timescaledb_internal._hyper_15_976_chunk] Finished in 2 seconds
Error: db error: ERROR: syntax error at or near ")"Caused by:
ERROR: syntax error at or near ")"Stack backtrace:
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:328:13
3: std::backtrace::Backtrace::capture
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/std/src/backtrace.rs:296:9
4: anyhow::error::<impl core::convert::From for anyhow::Error>::from
at /Users/beynon/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.33/src/error.rs:464:25
5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Resultcore::convert::Infallible,E>>::from_residual
at /rustc/2681f253bcdb31a274411d2be456e7c6a1c67d62/library/core/src/result.rs:2052:27
6: datanymizer_dumper::postgres::dumper::PgDumper<W,I>::dump_table
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:109:26
7: <datanymizer_dumper::postgres::dumper::PgDumper<W,I> as datanymizer_dumper::Dumper>::data
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/postgres/dumper.rs:173:17
8: datanymizer_dumper::Dumper::dump
at /Users/beynon/work/datanymizer/datanymizer_dumper/src/lib.rs:21:9
9: pg_datanymizer::app::App::run
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/app.rs:34:31
10: pg_datanymizer::main
at /Users/beynon/work/datanymizer/cli/pg_datanymizer/src/main.rs:13:5
One of the offending queries is:
COPY "_timescaledb_cache"."cache_inval_hypertable"() TO STDOUT
Solution seems to be if the column list is empty, the parentheses must be omitted:
COPY "_timescaledb_cache"."cache_inval_hypertable" TO STDOUT
I can easily skip these tables with an "except:" in the config.yaml, but then the pg_dump output is missing a table that may still be required by timescaledb to function properly.
Create example to introduce how to get starting with CLI tool pg_datanymizer
.
Setup environment with docker-compose, test database dump and example config file.
When I run pg_datanymizer against a GCP managed sql instance, I only get a data dump, it does not add any create statements for the relations.
Running pg_dump
directly against the same database with the same credentials gives a full dump as expected.
The command I am running is
pg_datanymizer db --config /app/config.yml --host cloud-sql-proxy --username <redacted> --password <redacted> --file /app/dump.sql --port 5432
The config just has table rules, no filters or anything like that.
Would appreciate any guidance on this.
This requires some refactoring of the UniqTransformer
trait.
Now we only have a stub:
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_engine/src/locale/ru.rs
Locale implementation example (English):
https://github.com/cksac/fake-rs/blob/master/fake/src/locales/mod.rs
Just clears field (sets NULL value).
Currently, I have 100s of tables in my database, all under different schemas.
I want to anonymize around 5 - 10 of my tables. I am able to achieve this using the following config
.....
filter:
data:
only:
- "default$default.Profile"
...
Potentially, this creates another issue as it copies over the other schema, which I don't want to copy.
When I try to use the following config:
.....
filter:
schema:
only:
- "default$default.Profile"
...
Only the tables selected get copy but not their dependencies, for e.g. ENUMs used in those tables.
So I would like to request for a feature in which filter accepts wildcard *
for anything under a schema.
something like:
filter:
schema:
only:
- "default$default.*"
except:
- "public.*"
Thank you in advance.
Possible config syntax:
tables:
- users:
# per table
limit: 10000
# ........................
default:
# global
limit: 1000
We can also think about more flexible customization - specifying an SQL query for the table.
Possible config syntax:
tables:
# ..............
default:
locale: RU
@evgeniy-r I am facing an issue while using the cli with my db. I am not sure that whether I am using the config correctly, can you help me rectify the issue.
thank you in advance.
I have a database with multiple schemas and I want to copy over the schema default$default
and the tables under that schema. Of course, I also anonymize certain fields in the tables.
tables:
- name: default$default.Profile
rules:
firstName:
first_name: {}
...
...
filter:
schema:
only:
- default$default
$ ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml --accept_invalid_certs $POSTGRES_CONNECTION_STRING
Prepare data scheme...
pg_dump error. Command:
pg_dump --section pre-data -t default$default <POSTGRES_CONNECTION_STRING>
Output:
pg_dump: error: no matching tables were found
Now, when filter added to the list of tables, Its filter only data copying process. The data schema needs to be responsive to the filter as well. To do this, you need to add arguments to the call of the command pg_dump
There are 2 steps with using pg_dump
cli application.
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L62
and
https://github.com/datanymizer/datanymizer/blob/main/datanymizer_dumper/src/postgres/dumper.rs#L134
Need to add inclusion and exclusion support to it.
Set up GitHub Actions to automate release process
I am getting a panic when I try to store some values in the key-value store using the example mentioned in the docs.
tables:
- name: User
rules:
phone:
template:
format: "{{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}"
rules:
- phone:
format: "+############"
...........
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Msg("Failed to parse 'TemplateTransformerTemplate'"), source: Some(Error { kind: Msg(" --> 1:24\n |\n1 | {{ _1 }}{{ store_write key='user_names.' ~ prev.profile, value=_1 }}\n | ^---\n |\n = expected `or`, `and`, `not`, `<=`, `>=`, `<`, `>`, `==`, `!=`, `+`, `-`, `*`, `/`, `%`, a filter, or a variable end (`}}`)"), source: None }) }', datanymizer_engine/src/transformers/template/mod.rs:64:59
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/bin/bash: line 54: 28 Aborted ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml $SOURCE_POSTGRES_CONNECTION_STRING --accept_invalid_certs=true
It will be good to skip transformations using if-confitions... for example
phone:
rules:
conditions:
- if: "{% condition statement-1 %}"
rules:
- format:
template: "bla-bla-bla"
- if: "{% condition statement-2 %}"
rules:
- format:
template: "my cool template"
- rules:
- format:
template: "bla-bla-bla"
Fetch tables metadata...
takes more time than a full database dump.
I have filters in place for only 4 tables I need out of 100+ but this still takes longer than I expect.
As mentioned in the readme there is a possible way to share information between rows. I was wondering if is it possible to share information between tables? My usecase has a need to keep a value consistent in table1 and table2.
For example:
I have 2 tables named Table1
and Table2
. Both of them have a field emailId
in them and I want to anonymize them and keep the value consistent? is that possible?
It appears to impact fields that I have configuration for:
tables:
- name: user
rules:
name:
person_name: {}
birthdate:
datetime:
from: 1980-01-01T00:00:00+00:00
to: 2021-07-24T00:00:00+00:00
When loading the dump I see this error:
COPY 0
ERROR: invalid input syntax for type timestamp with time zone: "Vanessa Wyman"
CONTEXT: COPY user, line 1, column birthdate: "Vanessa Wyman"
COPY public.user("id", "firebase_uid", "created_at", "updated_at", "username", "bio", "verified", "flagged", "private_account", "avatar_image_id", "user_role_type", "avatar_foreground_color", "avatar_background_light_color", "avatar_background_dark_color", "name", "birthdate", "last_seen", "google_iap_uid") FROM STDIN;
2afa72c2-f06a-4bcd-88e8-6bac7b32fa80 ANDROID- 17 2021-06-14 19:09:17.508424+00 2021-06-14 19:09:17.508424+00 somename \N f f f \N USER \N \N \N \N Vanessa Wyman 2004-01-13T00:14:00+00:00 \N
Compare performance with raw pg_dump
, possibly identify bottlenecks.
birth_month:
datetime:
format: '%m'
throws error Error: Failed transform trailing input
birth_month:
datetime:
format: %m
throws error Error: unexpected character: '%' at line 68 column 19 in config.yaml
This link is broken and "make bootstrap / make download" isn't working:
https://github.com/datanymizer/datanymizer/blob/main/demo/Makefile#L1
This should be replaced with this one, which is working OK:
https://www.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip
https://github.com/datanymizer/datanymizer/actions/runs/1432731164
One option is to get rid of chrono
and only use time
.
But chrono
supports more date and time formats (when parsing a configuration).
I am trying to use datanymizer in the environment that requires SSL connection (Heroku postgres). It fails with an error:
$ docker run --rm -v /Users/f213/prj/education/dev-db:/app -w /app datanymizer/pg_datanymizer "postgres://user:pwd@<REDACTED>.eu-west-1.compute.amazonaws.com:5432/<REDACTED>?sslmode=require"
Error: error performing TLS handshake: no TLS implementation configured
Caused by:
no TLS implementation configured
Any chances this great project will support SSL?
arr! macro erases lifetimes
Details | |
---|---|
Package | generic-array |
Version | 0.12.3 |
URL | fizyk20/generic-array#98 |
Date | 2020-04-09 |
Patched versions | >=0.14.0 |
Unaffected versions | <0.8.0 |
Affected versions of this crate allowed unsoundly extending
lifetimes using arr!
macro. This may result in a variety of
memory corruption scenarios, most likely use-after-free.
See advisory page for additional details.
Set up GitHub Actions to run these commands:
We can support both postgresql://...
and postgres://...
schemas (now only postgres://...
is supported).
Using pg_dump
and pg_restore
you can exclude owner/priv information using --no-owner
and --no-privileges
. Using psql < dump
to do restores as documented, doesn't allow this type of thing. Is this possible without having to manually text process the dump file?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.