The quipp-collab from alan-turing-institute

Inviting collaborators to this repository

When discussing collaborating with another project with access to medical data, the PI expressed a desire to have visibility into what we are doing. We discussed the option of inviting collaborators to this repo.

Tech report on definitions of utility preservation

Add microsim papers from Dropbox to Zotero

Add papers from the Dropbox folder to Zotero (link sent by Alison on 10th October).

Initial project presentation

Martin gave an initial presentation on the project to the first Turing Health Programme seminar on Wed 14 October. The presentations for this event were all ASG funded projects related to the Health theme.

Data science for mental health seminar (30th Jan)

The talk on Thursday 19th December looks relevent: Towards Shareable Data in Clinical Natural Language Processing: Generating Synthetic Electronic Health Records, Julia Ive.

The seminar starts at 3:15pm, and they're usually in Ada-Augusta. Full timings and other details are available here.

CTGAN pipeline for modelling tabular data (continuous and discrete columns)

Relevant issues:
First attempt at a Deep Learning pipeline #23

References for CTGAN
GitHub repo: https://github.com/DAI-Lab/CTGAN
Paper: https://arxiv.org/abs/1907.00503

Abstract (above paper):
Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

Born in Bradford cohort

Add Greg to the Zotero group (SyntheticData)

@martintoreilly, as owner I think you need to do this please.

@gmingas has Zotero username 'gmingas'

Tech report on definitions of privacy preservation

US adult census dataset

US adult census dataset. Used in the ONS blog post as an example. Can be used in a classifier task - are people earning above or below $50,000 based on other characteristics?

https://archive.ics.uci.edu/ml/datasets/adult

Set up CI on the pipeline repo

The (minimal so far) report uses GitHub Actions for building the pdf. I would like to suggest we give this a try for our CI as well. Does anyone have any thoughts on this? I haven’t used it other than just now, so would be good to hear of any pitfalls. It's free to use for public repositories.

Australian government data

I had a note about this - does anyone know more?

Microsimulation literature

https://www.dropbox.com/sh/ia02dqg0ql78jd4/AADtzEDWvtA2K1S-vjFUc5kXa?dl=0

Add these to Zotero

GAN libraries

List of potentially interesting libraries to be used in our GAN pipeline:

CTGAN for modeling Tabular data using Conditional GAN.
Collection of PyTorch implementations of Generative Adversarial Network varieties: more than 30 different implementations of GAN varieties presented in research papers.

Tax data

Swedish individual incomes are released in a paper book for each region (the Taxeringskalender, with information provided by the Swedish tax office. Swedish citizens can also query an individual's income online, but a copy of the query is received by the query subject.

The Swedish tax office is keen in improving access to and use of open data and has a contact form for collaborations.

Once we have a web presence (and maybe our first blog post) we should contact them to see if they are interested in collaborating to generate a synthetic income data set. Although they make this data open, it's not available electronically in bulk so I think they could see a benefit to themselves in having a more open synthetic data set. They may also potentially see value in supporting work that may enable tax offices in other countries to release more detailed income data.

I also met James Dainty from HMRC Labs at the Manchester SDAP meeting in December. With the changes in data handling legislation in the Digital Economy Act, HMRC no longer requires all research to be directly improving HMRC operations, so we should also contact James to see if there is a chance of us working with them to generate synthetic UK income data.

Quarterly Labour Force Survey, January - March, 2015: Unrestricted Access Teaching Dataset

https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7912#!/details#administrative

Create Zotero group

Create group in Zotero to collect relevant papers.

Twitter data

API details: https://help.twitter.com/en/rules-and-policies/twitter-api

A nice library to help with accessing the APIs: http://www.tweepy.org/

MoveBank database of animal GPS movements from various sources

https://www.movebank.org/panel_embedded_movebank_webapp

NLP reading group on privacy-preserving data sharing (12th Dec)

Turing Slack, "#nlpreading"

nicole_peinelt 5:11 PM
Dear all,
Here are the details for next week’s reading group:
Paper: Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing (https://www.ahajournals.org/doi/10.1161/CIRCOUTCOMES.118.005122)
Facilitators: Julia Ive and Zixu Wang
Date: Thursday, 21st November 2019
Time: 4-5pm
Location: The Alan Turing Institute, David Blackwell Meeting Room
We kindly ask external participants to let us know a day in advance if you’re planning to join as we need to notify reception ahead of time.
Please also remember to suggest papers and volunteer as presenter in our google doc schedule: https://docs.google.com/document/d/1aWv5JEi-0ehIOFjU0YiI8K1JHcNMYZcHRtfSV0mMYdQ/edit.
Best,
Nicole

Good review of syn data methods

https://www.ijstr.org/final-print/mar2017/A-Review-Of-Synthetic-Data-Generation-Methods-For-Privacy-Preserving-Data-Publishing.pdf

Write down the project roadmap

A common theme in the "What was lacking" section of our recent REG retrospective (notes available in meeting-notes/2019-12-05-reg-retrospective.md) was the lack of a roadmap and the medium/long-term structure of the project.

Several team members have joined only in the last month, so won't have been around for our initial discussions on the scope of the project. We also don't have any proposed timelines etc. in this repository - if anyone has one available, could they please add it?

January would be a good time to assess our goals going forwards. We suggested during the retrospective that one of our regular meetings would be a good time to discuss this further.

First attempt at a multi-imputation based method

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

Apply for Azure credits for Safe Haven

We will need a safe haven environment to work with the real world sensitive datasets we hope to have access to.

See application form on Turing Complete.

A rough estimate of costs can be made using this costing estimator.

Differential privacy methods - investigate and run

NYC Taxi & Limousine Commission trip data

Trip records capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Two datasets available, both covering the period 2009-2018:

Yellow Taxi records: 1.5B rows (50GB) in total as of 2018
For Hire Vehicle (FHV) records: 500M rows (5GB) as of 2018

Note: It looks like pick up and drop off locations are just a pseudonymised location ID and do not come with actual location data

Patient ICU data used by AIDA

find out if we can use it?
~~(- is this the same as Sebastian's?)~~ (No)

This sounds like a good example to start with a cleaned subset, because it is already well understood (by the AIDA team).

Write Synthetic Data section for Privacy Preserving Computation paper

Resources

NHS Information Governance Toolkit - Requirements for Secondary Use Organisations (original | zotero)

First attempt at autoencoder based pipeline

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

MIMIC intensive care dataset

Summary from the website

MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~60,000 intensive care unit admissions. It includes demographics, vital signs, laboratory tests, medications, and more.

Data table descriptions

First attempt at a microsimulation based pipeline

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

Change the name of the project

Do we want to change the name of the project? Its currently:

Quantifying UncertaInty and Preserving Privacy in Synthetic Data Sets (QUiPP)

I feel that the Privacy Preservation part should be more upfront!

Any suggestions for a new title and acronym before Martin launches the project next week?!

Outcome

Final decision on project name: QUiPP: Quantifying Utility and Preserving Privacy in Synthetic Data Sets

First attempt at a SMOTE based pipeline

SMOTE is a technique initially developed for oversampling the minority class in imbalanced datasets but can be used for data synthesis too. A quick explanation is provided in https://datasciencecampus.ons.gov.uk/projects/synthetic-data-for-public-good/

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

Cohort of 100 million Brazilians

http://cidacs.bahia.fiocruz.br/?lang=en

Reflections on statistical disclosure control in practice

First attempt at a Deep Learning pipeline

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

US adult census dataset

Link: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/

The dataset has 30,162 records and a binomial label indicating a salary of less than $50,000 or greater than $50,000. Of the records in the dataset, 75% have a class label of greater than $50,000. There are 14 attributes consisting of eight categorical and six continuous attributes.

Write up ways of working

See documents in the Turing Way project for ideas.

(Further notes on this issue coming shortly)

Apply for ethics approval

We should do this as soon as possible and certainly before we receive any data.

Apply on Turing Complete

Make a project logo

Review existing reviews

Let's pick these off one by one to review and evaluate / summarise. When you take one, make a separate ticket for it and add your review notes to this new issue if short or a separate linked markdown file if longer.

Existing toolboxes / code

Existing reviews / guides

Government Statistical Service review of Privacy and data confidentiality methods: [original | zotero] - Martin
ONS Data Science Campus evaluation of VAEs, GANs and SMOTE - [original | zotero] Louise
- ONS Data Science Campus detailed report on GANs - [original | zotero] Louise
UK Data Service Statistical Disclosure Control Handbook - Oliver
ONS Synthetic data pilot - Kasra

Journals / Conference proceedings

Privacy in Statistical Databases (Springer Nature)
Journal of Privacy and Confidentiality
Transactions in Data Privacy

Add Camila to the Zotero group

@crangelsmith could you let @martintoreilly know your username please?

Particle-physics-based synthetic data

@crangelsmith

synthetic data for blind analysis
respecting e.g. conservation laws

MSR GPS Privacy data

Open GPS data for 21 individuals.

See https://msropendata.com/datasets/94d31431-0842-447c-b990-245761b7c5f2

Microsoft researchers John Krumm and his collaborators collected GPS data from 21 people who carried a GPS receiver in the Seattle, Washington area. Users who provided the data agreed for the data to be shared as long certain geographic regions were deleted.

This covers key research on privacy preservation of GPS data as evidenced in the corresponding paper "Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location", Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010), September 26-30, 2010.

The paper has been cited dozens of times, including for research that builds upon this important work to further the field of preservation of geo-privacy for location-based services providers.

Write up notes from REG retrospective meeting

Write up post-it notes from retrospective in the HackMD.
Move HackMD notes into GitHub.
Make issues for any actions we haven't already dealt with.

Investigate privacy/deanonimization attacks

Set this up in the pipeline

OpenHumans data

Could be an interesting source of individual-level data, but not many participants so far.

https://www.openhumans.org/

Initial evaluation pipeline with "trivial" synthesis techniques and privacy / utility preservation measures

Project web page

Use form at: https://turingcomplete.topdesk.net/tas/public/ssp/content/serviceflow?unid=48da6fe7c46b4c7393a4cfb350b4a083&openedFromService=true

Project title

Quantifying Utility and Privacy Preservation in Synthetic Data

Project leaders

Martin O'Reilly (The Alan Turing Institute)
Alison Heppenstall (University of Leeds)
Nik Lomax (University of Leeds)
Nick Malleson (University of Leeds)
Sebastian Vollmer (The Alan Turing Institute, University of Warwick)

Details

Project page main contact

Contact name: Martin O'Reilly
Contact email address: <martin's email>
Project start date: 01 Oct 2019
Project end date: 31 Mar 2020

Sub-heading

1 sentence summary/sub-heading
(1 sentence, present tense, e.g. Using…, Developing…, Investigating…) *

Brief description

(Clear, concise, ~3 sentences – e.g. 1st sentence: the problem being addressed, 2nd sentence: the potential solution/method, 3rd sentence: applications, output) *

Aims/expected outcomes

(What is the work hoping to achieve? What would define success? Why is this work worth doing?) 100-300 words *

Explaining the science

(Is there theory or methods that would be good to explain to understand the project’s work better? Use plain English where possible) 100-300 words *

Real world applications

(Where is this work being applied, what area/industry could it benefit?) 100-300 words *

Recent updates

(Achievements/project milestones reached since project started, with month/year)

Sensitivities

Are there likely to be any sensitivities within or around this project? (For example it deals with highly sensitive subject matter such as abuse, violence, grief, etc):

Yes, there are sensitivities / No, there aren't any sensitivities

Participating researchers

Oliver Strickson (The Alan Turing Institute)
Louise Bowler (The Alan Turing Institute)
Kasra Hoseinni (The Alan Turing Institute)
Greg Mingas (The Alan Turing Institute)
Camila Rangel-Smith (The Alan Turing Institute)

Collaborating organisations/universities

(Please include their roles as part of the project, e.g. funder, collaborator, data supplier etc)
University of Leeds (Collaborator)
University of Warwick (Collaborator)
UKRI (Funder)

Tagging

If this project is part of a programme(s) please tick below:

Is this project funded by the Strategic Priorities Fund?

Yes

Research areas (required)

(Please tick the research areas that are most applicable, up to max 10)

Algorithms

Applied Mathematics

Artificial Intelligence

Computer Systems & Architecture

Machine Learning

Mathematical Modelling

Optimisation

Convex Programming
Nonlinear Programming
Stochastic Optimisation

Privacy & Trust

Cryptography
Differential Privacy
Identity Management
Verification

Programming Languages

Social Data Science

Statistical Methods & Theory

Theoretical Mathematics

Data anonymisation debate - Manchester

Circulated on the "Secure Data Access Professionals" email list by Mark Elliot.

DEBATE PROPOSITION: "DATA CAN EITHER BE USEFUL OR ANONYMISED BUT NEVER BOTH."
12 NOVEMBER 2019
Time: 14:00 - 16:00
Venue: Room 3.009 Alliance Manchester Business school, Booth Street West, Manchester, M5 6PB

alan-turing-institute / quipp-collab Goto Github PK

quipp-collab's Introduction

QUIPP-collab

Admin

Ways of working

quipp-collab's People

Contributors

Stargazers

Watchers

Forkers

quipp-collab's Issues

Resources

Outcome

Existing toolboxes / code

Existing reviews / guides

Journals / Conference proceedings

Project title

Project leaders

Details

Sub-heading

Brief description

Aims/expected outcomes

Explaining the science

Real world applications

Recent updates

Sensitivities

Participating researchers

Collaborating organisations/universities

Tagging

Research areas (required)

Algorithms

Applied Mathematics

Artificial Intelligence

Computer Systems & Architecture

Machine Learning

Mathematical Modelling

Optimisation

Privacy & Trust

Programming Languages

Social Data Science

Statistical Methods & Theory

Theoretical Mathematics

Recommend Projects

Recommend Topics

Recommend Org