Giter Site home page Giter Site logo

quipp-collab's Introduction

QUIPP-collab

Collaboration and project management of the QUIPP project.

Code and documentation are in the QUIPP-pipeline repo.

Admin

  • Our project code is R-SPET-202 (for Turing-internal purposes)
  • Larger or more sensitive project documents are in the project SharePoint / Teams site (access for core team via their @turing.ac.uk accounts)

Ways of working

See our ways-of-working.md document, which covers:

  • Project team
  • Communication
  • Project management with GitHub
  • Zotero

quipp-collab's People

Contributors

geoalison avatar gmingas avatar hackmd-deploy avatar kasra-hosseini avatar louiseabowler avatar martintoreilly avatar oscartgiles avatar ots22 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

oscartgiles

quipp-collab's Issues

Create Zotero group

Create group in Zotero to collect relevant papers.

MIMIC intensive care dataset

Summary from the website

MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~60,000 intensive care unit admissions. It includes demographics, vital signs, laboratory tests, medications, and more.

Data table descriptions

Tax data

Swedish individual incomes are released in a paper book for each region (the Taxeringskalender, with information provided by the Swedish tax office. Swedish citizens can also query an individual's income online, but a copy of the query is received by the query subject.

The Swedish tax office is keen in improving access to and use of open data and has a contact form for collaborations.

Once we have a web presence (and maybe our first blog post) we should contact them to see if they are interested in collaborating to generate a synthetic income data set. Although they make this data open, it's not available electronically in bulk so I think they could see a benefit to themselves in having a more open synthetic data set. They may also potentially see value in supporting work that may enable tax offices in other countries to release more detailed income data.

I also met James Dainty from HMRC Labs at the Manchester SDAP meeting in December. With the changes in data handling legislation in the Digital Economy Act, HMRC no longer requires all research to be directly improving HMRC operations, so we should also contact James to see if there is a chance of us working with them to generate synthetic UK income data.

Project web page

Use form at: https://turingcomplete.topdesk.net/tas/public/ssp/content/serviceflow?unid=48da6fe7c46b4c7393a4cfb350b4a083&openedFromService=true

Project title

Quantifying Utility and Privacy Preservation in Synthetic Data

Project leaders

Martin O'Reilly (The Alan Turing Institute)
Alison Heppenstall (University of Leeds)
Nik Lomax (University of Leeds)
Nick Malleson (University of Leeds)
Sebastian Vollmer (The Alan Turing Institute, University of Warwick)

Details

Project page main contact

  • Contact name: Martin O'Reilly
  • Contact email address: <martin's email>
    Project start date: 01 Oct 2019
    Project end date: 31 Mar 2020

Sub-heading

1 sentence summary/sub-heading
(1 sentence, present tense, e.g. Using…, Developing…, Investigating…) *

Brief description

(Clear, concise, ~3 sentences – e.g. 1st sentence: the problem being addressed, 2nd sentence: the potential solution/method, 3rd sentence: applications, output) *

Aims/expected outcomes

(What is the work hoping to achieve? What would define success? Why is this work worth doing?) 100-300 words *

Explaining the science

(Is there theory or methods that would be good to explain to understand the project’s work better? Use plain English where possible) 100-300 words *

Real world applications

(Where is this work being applied, what area/industry could it benefit?) 100-300 words *

Recent updates

(Achievements/project milestones reached since project started, with month/year)

Sensitivities

Are there likely to be any sensitivities within or around this project? (For example it deals with highly sensitive subject matter such as abuse, violence, grief, etc):

Yes, there are sensitivities / No, there aren't any sensitivities

Participating researchers

Oliver Strickson (The Alan Turing Institute)
Louise Bowler (The Alan Turing Institute)
Kasra Hoseinni (The Alan Turing Institute)
Greg Mingas (The Alan Turing Institute)
Camila Rangel-Smith (The Alan Turing Institute)

Collaborating organisations/universities

(Please include their roles as part of the project, e.g. funder, collaborator, data supplier etc)
University of Leeds (Collaborator)
University of Warwick (Collaborator)
UKRI (Funder)

Tagging

If this project is part of a programme(s) please tick below:

  • Artificial intelligence (Safety and ethics)
  • Artificial intelligence (Robotics)
  • Data science at scale
  • Data science for science
  • Data-centric engineering
  • Defence and security
  • Finance and economics
  • Health
  • Policy
  • Research Engineering
  • Urban analytics

Is this project funded by the Strategic Priorities Fund?

  • Yes

Research areas (required)

(Please tick the research areas that are most applicable, up to max 10)

Algorithms

  • Complexity
  • Compression
  • Cryptography
  • Data Structures
  • Distributed
  • Numerical

Applied Mathematics

  • Dynamical Systems & Differential Equations
  • Information Theory
  • Mathematical Physics
  • Multi-Agent Systems
  • Numerical Analysis
  • Operations Research

Artificial Intelligence

  • Control Theory
  • Evolution & Adaptation
  • Game Theory
  • Knowledge Representation
  • Multi-agent Reasoning
  • Neural Networks
  • Neuroscience
  • Nonlinear Dynamics
  • Pattern Formation
  • Robotics
  • Symbolic systems
  • Systems Theory

Computer Systems & Architecture

  • Communications
  • Computing Networks
  • Databases
  • Human Computer Interface
  • Information Retrieval
  • Neural & Evolutionary Computing
  • Operating Systems
  • Real Time Computing
  • Parallel Computing
  • Visualisation

Machine Learning

  • Applications
  • Computer Vision
  • Deep Learning
  • Natural Language Processing
  • Pattern Recognition
  • Reinforcement Learning
  • Semi-Supervised
  • Speech Recognition
  • Supervised
  • Unsupervised

Mathematical Modelling

  • Agent-based Modelling
  • Automata & Algebraic
  • Deterministic
  • Dynamic/Static
  • Graph Theory
  • Ensemble
  • Stochastic

Optimisation

  • Convex Programming
  • Nonlinear Programming
  • Stochastic Optimisation

Privacy & Trust

  • Cryptography
  • Differential Privacy
  • Identity Management
  • Verification

Programming Languages

  • Hardware Optimisation (FPGA/GPU)
  • Literate Programming
  • Probabilistic Programming
  • Software Framework Development
  • Theory of Programming Languages
  • Visualisation

Social Data Science

  • Cognitive Sicence
  • Data Science of Government & Politics
  • Developmental psychology
  • Ethics
  • Linguistics
  • Management Science
  • Research Methods
  • Social Media
  • Social Networks
  • Social Psychology

Statistical Methods & Theory

  • Asymptotic
  • Causality
  • Estimation Theory
  • High Dimensional Inference
  • Information Theory
  • Modelling
  • Monte Carlo Methods
  • Non-parametric & Semi-parametric Methods
  • Probability
  • Simulation
  • Spatial Analytics
  • Time Series
  • Uncertainty Quantification

Theoretical Mathematics

  • Algebra
  • Calculus & Analysis
  • Combinatorics
  • Geometry & Topology
  • Logic
  • Number Theory

Inviting collaborators to this repository

When discussing collaborating with another project with access to medical data, the PI expressed a desire to have visibility into what we are doing. We discussed the option of inviting collaborators to this repo.

NYC Taxi & Limousine Commission trip data

Trip records capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Two datasets available, both covering the period 2009-2018:

Note: It looks like pick up and drop off locations are just a pseudonymised location ID and do not come with actual location data

Review existing reviews

Let's pick these off one by one to review and evaluate / summarise. When you take one, make a separate ticket for it and add your review notes to this new issue if short or a separate linked markdown file if longer.

Existing toolboxes / code

Existing reviews / guides

Journals / Conference proceedings

  • Privacy in Statistical Databases (Springer Nature)
  • Journal of Privacy and Confidentiality
  • Transactions in Data Privacy

Data science for mental health seminar (30th Jan)

The talk on Thursday 19th December looks relevent: Towards Shareable Data in Clinical Natural Language Processing: Generating Synthetic Electronic Health Records, Julia Ive.

The seminar starts at 3:15pm, and they're usually in Ada-Augusta. Full timings and other details are available here.

Change the name of the project

Do we want to change the name of the project? Its currently:

Quantifying UncertaInty and Preserving Privacy in Synthetic Data Sets (QUiPP)

I feel that the Privacy Preservation part should be more upfront!

Any suggestions for a new title and acronym before Martin launches the project next week?!

Outcome

Final decision on project name: QUiPP: Quantifying Utility and Preserving Privacy in Synthetic Data Sets

Patient ICU data used by AIDA

  • find out if we can use it?
    (- is this the same as Sebastian's?) (No)

This sounds like a good example to start with a cleaned subset, because it is already well understood (by the AIDA team).

MSR GPS Privacy data

Open GPS data for 21 individuals.

See https://msropendata.com/datasets/94d31431-0842-447c-b990-245761b7c5f2

Microsoft researchers John Krumm and his collaborators collected GPS data from 21 people who carried a GPS receiver in the Seattle, Washington area. Users who provided the data agreed for the data to be shared as long certain geographic regions were deleted.

This covers key research on privacy preservation of GPS data as evidenced in the corresponding paper "Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location", Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010), September 26-30, 2010.

The paper has been cited dozens of times, including for research that builds upon this important work to further the field of preservation of geo-privacy for location-based services providers.

Set up CI on the pipeline repo

The (minimal so far) report uses GitHub Actions for building the pdf. I would like to suggest we give this a try for our CI as well. Does anyone have any thoughts on this? I haven’t used it other than just now, so would be good to hear of any pitfalls. It's free to use for public repositories.

NLP reading group on privacy-preserving data sharing (12th Dec)

Turing Slack, "#nlpreading"

nicole_peinelt 5:11 PM
Dear all,
Here are the details for next week’s reading group:
Paper: Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing (https://www.ahajournals.org/doi/10.1161/CIRCOUTCOMES.118.005122)
Facilitators: Julia Ive and Zixu Wang
Date: Thursday, 21st November 2019
Time: 4-5pm
Location: The Alan Turing Institute, David Blackwell Meeting Room
We kindly ask external participants to let us know a day in advance if you’re planning to join as we need to notify reception ahead of time.
Please also remember to suggest papers and volunteer as presenter in our google doc schedule: https://docs.google.com/document/d/1aWv5JEi-0ehIOFjU0YiI8K1JHcNMYZcHRtfSV0mMYdQ/edit.
Best,
Nicole

CTGAN pipeline for modelling tabular data (continuous and discrete columns)

Relevant issues:
First attempt at a Deep Learning pipeline #23


References for CTGAN
GitHub repo: https://github.com/DAI-Lab/CTGAN
Paper: https://arxiv.org/abs/1907.00503

Abstract (above paper):
Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

Write down the project roadmap

A common theme in the "What was lacking" section of our recent REG retrospective (notes available in meeting-notes/2019-12-05-reg-retrospective.md) was the lack of a roadmap and the medium/long-term structure of the project.

Several team members have joined only in the last month, so won't have been around for our initial discussions on the scope of the project. We also don't have any proposed timelines etc. in this repository - if anyone has one available, could they please add it?

January would be a good time to assess our goals going forwards. We suggested during the retrospective that one of our regular meetings would be a good time to discuss this further.

Data anonymisation debate - Manchester

Circulated on the "Secure Data Access Professionals" email list by Mark Elliot.

DEBATE PROPOSITION: "DATA CAN EITHER BE USEFUL OR ANONYMISED BUT NEVER BOTH."
12 NOVEMBER 2019
Time: 14:00 - 16:00
Venue: Room 3.009 Alliance Manchester Business school, Booth Street West, Manchester, M5 6PB

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.