machbarmacher / gdpr-dump Goto Github PK
View Code? Open in Web Editor NEWA drop-in replacement for mysqldump that optionally sanitizes DB fields for better GDPR conformity.
A drop-in replacement for mysqldump that optionally sanitizes DB fields for better GDPR conformity.
This issue is a reminder/placeholder for future work.
Presently, we need to work with the dev-master version of mysqldump-php in order to incorporate one of the override hooks.
As soon as the new version of 2.4 (or greater) is released, we need to lock out the composer.json reference to that.
Hi all,
When plugging this into drush sql-dump
I saw that we needed the ignore-table
option (see https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html#option_mysqldump_ignore-table).
Currently we're following the mysqldump-php convention of "exclude-tables", but that doesn't seem to be a mysqldump option, so I suggest we swap out one with the other.
Something that will definitely be useful for our (read Amazee(Labs/io)'s) use case is adding extra mysqloption file's.
The problem with the existing my.cnf files (and paths) is that it's impossible to tell if there is a gdpr specific file. Here I'm thinking of cases where the setup of the gdpr-dump comes apart from the setup of mysql and mysqldump itself.
For instance, if we have a module that exports the gdpr-expressions data, it may be useful to mark it as something other than .my.cnf (and related files).
My suggestion, to begin, is to add the environment variable GDPR_DUMP_HOME where we would check for gdpr.cnf and .gdpr.cnf files, as well as testing for the existence of a .gdpr.cnf file in the MYSQL_HOME directory.
One of the major issues I'm having evaluating PRs is the lack of a test framework. Since we're trying to move a POC that was built pretty quickly, we're severely lacking testing.
I think that this should be addressed as a matter of immediate and serious concern.
I'd love to bounce some ideas of anyone other there about this.
Essentially we not only need to think about unit testing the various bits, but also some way of easily testing that the DB dumps themselves are applying (and keep applying) the appropriate transformations after we've incorporated new code.
Please, if anyone has any suggestions here, I'd really love to hear them.
From my side, I'm going to be working on some minimal set of containerized tests to try get us going - but I'm open to any suggestions (and PRs)
When installing with composer 2 you get warning:
Package fzaninotto/faker is abandoned, you should avoid using it. No replacement was suggested.
Replace abandoned package fzaninotto/faker with fakerphp/faker
Hi @axel-rutz (cc @fjgarlin)
Is there any particular license you wanted to use for this project?
Are you happy I add, say, MIT?
Any other preference?
I suggest accepting a series of formatters that might or might not be implemented via faker. Faker is still the way to go but might not be always suitable. We'd just need to create a map of key and how to map that key.
This will also simplify the gdpr replacements expression. See below.
Suggested gdpr-replacements:
{"tableName":{"columnName1":{"formatter":"username"},"columnName2":{"formatter":"password"},"columnName3":{"formatter":"email"}}}
We want to make the column transformation process as loose as possible, therefore moving away from the explicit coupling we have at the moment.
currently gdpr-expressions is supported as a command line argument, but gdpr-replacements is not (although it's a valid argument in .cnf files)
We need to add it.
It would be useful to output the effective output what sanitization details are being read from the configuration files - that is, to display the transformation mappings that will be used if the dump was run.
Since we read configuration information from several places, this would be useful in terms of displaying exactly what we can expect on run.
When using original mysqldump, I can write
mysqldump -u username -p db_name
And mysqldump will request for a passwort
Using gdpr-dump's mysqldump replacement, it doesn't understand that "db_name" might not be a password, instead it complains " Not enough arguments (missing: "db-name")." If I use the format
mysqldump db_name -u username -p
It tries to access the db without a password, doesn't request for a password.
I just don't like typing my password into the commandline, that's why I always use mysqldump as mentioned above.
Could you have it going like the original mysqldump: if -p is given but db-name appears missing, consider the last parameter the db-name and request for a password?
Hi @axel-rutz
Thanks a bunch for pushing this. I think it's a super exciting approach that's able to be used in several places.
Second, I've been thinking about the approach and one of the things I was thinking is that perhaps we might want to integrate Faker into the process, rather than the expressions based approach (or, better, the two might live side by side).
I've been doing some work on my side, and while it's a total WIP at the moment (in particular, I'm really uncomfortable with the way I've abstracted the expression stuff, and will rewrite ASAP), I thought that since I've got something this would be a good opportunity as any to get talking about it.
So what I've done for now is changed up the --gdpr-expressions switch a little to accept something like this
--gdpr-expressions=\'{"fakertest":{"name":{"transformer":"faker","formatter":"name"}, "telephone":{"transformer":"faker","formatter":"phoneNumber"}}}\'
Which marks particular columns as engaging Faker and which formatter will be used for output. Obviously this could be expanded to include Faker arguments etc. if we wanted them.
If an object isn't passed, it interprets it as a DB expression and uses your current approach.
I'd love to get a discussion started about this.
It seems as though this would be useful to distribute as a single .phar. Thoughts?
mysqldump seems to add DROP TABLE statements, gdpr-dump should be a drop-it replacement
and doesn't.
Shall we change this? Would be a major version bump though ..
Exporting data even without any parameters breaks sql dump. It shows the table structure, but when the data should be returned, process stops with no errors.
KEY `user_field__created` (`created`),
KEY `user_field__access` (`access`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='The data table for user entities.';
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `users_field_data`
--
LOCK TABLES `users_field_data` WRITE;
/*!40000 ALTER TABLE `users_field_data` DISABLE KEYS */;
SET autocommit=0;
[error] Database dump failed
[error] Unable to dump database. Rerun with --debug to see any error message.
And adding a --debug
does not reveal anything.
What I found out, it breaks here
$columnTypes = $this->tableColumnTypes()[$tableName];
And it's because ifsnop/mysqldump-php#141 was never accepted and so the new solution using hooks was not implemented. Also, I have a situation where I can't edit project's json file to add patches and just using composer require does not apply patches (137 to be specific).
My proposal is adding a check on $this->gdprExpressions[$tableName]
before invoking $this->tableColumnTypes()[$tableName]
. See PR here: #27
The composer.json specifies dev-master for mysqldump, however, this program is not compatible above version 2.7 currently. Until support for the newer mysqldump is added the composer.json should specify:
"ifsnop/mysqldump-php": "2.7"
Is there any support for truncating a table?
For example, I don't want to sanitize the data in my webform submissions table, I want to truncate it.
So basically I just want the structure of this table to be exported.
When dumping a database we might not want to copy across or sanitize all tables (ie: cache tables). I suggest accepting this parametre (and/or in the configuration files) to be able to just create those tables but don't transfer any data.
Suggested:
"gdpr-skip-tables":["tableName1","tableName2","tableName3"]
When using drush and a Faker formatter, I get following error:
SQLSTATE[42S22]: Column not found: 1054 Unknown column 'Array' in 'field list'
This is my drush command:
drush sql-dump --tables-list=users_field_data --extra-dump=$'--gdpr-expressions='{"users_field_data":{"name":"uid","mail":"uid","init":"uid","pass":{"formatter":"clear"}}}''
Am I doing something wrong or is this a bug?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.