statisticsnorway / dapla-toolbelt-pseudo Goto Github PK
View Code? Open in Web Editor NEWPseudonymization extensions for Dapla Toolbelt
License: MIT License
Pseudonymization extensions for Dapla Toolbelt
License: MIT License
I have dataset with a single column "fnr", this is a sid-field.
dpp.pseudonymize(data, sid_fields=["fnr"]).json()
Returns
TypeError: pseudonymize() missing 1 required positional argument: 'fields'
Whenever I specify "sid_fields", I get a 500-server error.
dpp.pseudonymize(data, fields=["fnr"], sid_fields=["fnr"])
HTTPError: 500 Server Error: Internal Server Error for url: http://dapla-pseudo-service.dapla.svc.cluster.local/pseudonymize/file
If I only specify "fnr" as a "field", I get the wrong encryption. The goal is to get FPE on the "fnr" column.
Currently depseuonymize returns
TypeError: expected str, bytes or os.PathLike object, not DataFrame
The dapla-pseudo-service API supports compression, ref:
https://dapla-pseudo-service.staging-bip-app.ssb.no/api-docs/redoc
Users should be able to specify compression accordingly for all three endpoints:
The pseudo-service previously returned two metadata entries when using map-sid. This behaviour has been changed on the server side, so dapla-toolbelt-pseudo no longer needs to handle it.
If the REST API returns an non-200 status, we could do a better with regards to how these errors are presented to the user.
We should also be explicit about errors that is due to wrong user input (http status 4xx), and unexpected errors that occur on the server side (http status 5xx).
Parquet files may be partitioned for efficiency. When this happens, the dataset is saved as multiple .parquet
files in a directory. To open the dataset, one refers to the path to the directory. This isn't currently supported by dapla-toolbelt-pseudo
as we expect a file ending.
This is not currently possible, and is forcing me into what I would consider an anti-pattern.
Im using a single process_source-script for a load of files, sometimes they:
The anti-pattern consists of having three different codeblocks that use Pseudodata-chains dependant on what types of columns the dataset contains. See the code below. Should be possible to send an empty list or similar to .on_fields(), and then for dapla-toolbelt-pseudo to not send anything for pseudo, just pass the object along. If that was possible, I could reduce this down to a single Pseudo-block, that would be very nice ๐
if fnr_cols and non_map_cols:
logging.info("Want to pseudo and map to snr: %s - output path: %s",
str(fnr_cols), str(output_path))
logging.info("Want to pseudo, but not map to snr: %s - output path: %s",
str(non_map_cols), str(output_path))
df = (PseudoData.from_pandas(df)
.on_fields(*renames_postfix.keys())
.with_stable_id()
.on_fields(*non_map_cols)
.with_papis_compatible_encryption()
.pseudonymize()
.to_pandas()
)
elif fnr_cols:
logging.info("Want to pseudo and map to snr: %s\nOutput path: %s",
str(fnr_cols), str(output_path))
df = (PseudoData.from_pandas(df)
.on_fields(*renames_postfix.keys())
.with_stable_id()
.pseudonymize()
.to_pandas()
)
elif non_map_cols:
logging.info("Want to pseudo, but not map to snr: %s\nOutput path: %s",
str(non_map_cols), str(output_path))
df = (PseudoData.from_pandas(df)
.on_fields(*non_map_cols)
.with_papis_compatible_encryption()
.pseudonymize()
.to_pandas()
)
else: # No columns to pseudo
return df
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.