Name: Stefan Grafberger
Type: User
Company: University of Amsterdam
Bio: I am a Ph.D. student at BIFOLD & TU Berlin, conducting research at the intersection of data management and machine learning.
Twitter: SGrafberger
Location: Amsterdam
Blog: https://stefan-grafberger.com
Stefan Grafberger's Projects
🔎 Finds fuzzy matches between CSV spreadsheets
Imputation of missing values in tables.
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Rust Rust Rust!
Jenga is an experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models.
Action for compiling latex with make
Code and workloads from the Learned Cardinalities paper (https://arxiv.org/abs/1809.00677)
Some datasets for ML pipelines that I want to use for some experiments
Inspect ML Pipelines in Python in the form of a DAG
Inspect ML Pipelines in Python in the form of a DAG (CIDR Submission version)
The files for an initial exploratory user study. It provides the foundation for a larger user study in future work.
Data-Centric What-If Analysis for Native Machine Learning Pipelines
Supporting infrastructure to run scientific experiments without a scientific workflow management system.
Probabilistic Gradient Boosting Machines
A Fork to add dagre layout support
StreamDQ is a library built on top of Apache Flink for defining "unit tests for data", which measure data quality in large data streams.