Giter Site home page Giter Site logo

ryanwangzf / influence_subsampling Goto Github PK

View Code? Open in Web Editor NEW
65.0 6.0 15.0 11.25 MB

Official Implementation of Unweighted Data Subsampling via Influence Function - AAAI 2020

License: MIT License

Python 72.10% Jupyter Notebook 27.90%
influence-functions subsampling noisy-labels newton-cg truncated-newton-method aaai2020 aaai

influence_subsampling's Introduction

Unweighted Influence Data Subsampling (UIDS)

Conference

This repository provides a numpy and scipy based implementation on Unweighted Influence Data Subsampling (UIDS) on the Logistic Regression model. The UIDS can achieve good result when the data set quality is not good, such as noisy labels, or there is distribution shift between training and test set, by dropping several bad cases.

Paper & Citation

Less Is Better: Unweighted Data Subsampling via Influence Function

If you find this work interesting or helpful for your research, please consider citing this paper and give your star ^ ^

@inproceedings{wang2019less,
  title={Less Is Better: Unweighted Data Subsampling via Influence Function},
  author={Wang, Zifeng and Zhu, Hong and Dong, Zhenhua and He, Xiuqiang and Huang, Shao-Lun},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
  year={2020}
}

Introduction

In practice, it is common that some of the collected data are not good or even harmful for our model. On the other, sometimes the data distribution is not stable. There often lies distribution shift between the training set and the test set, causing the degradation performance of the classical supervised machine learning algorithms.

Intuition Demonstration

Subsampling tries to build a tool to quantify each data’s quality, thereby keeping good examples and dropping bad examples to improve model’s generalization ability. Previous works concentrate on Weighted subsampling, that is, trying to maintain the model performance when dropping several data.

By contrast, our work attempts to obtain a superior model by subsampling.

The different between them can be shown as the image below:



  • (a) means if the blue points (training samples) within the red circle are removed, the new optimal decision boundary is still same as the former one
  • (b) if removing blue points in the red circle, the new decision boundary shifts from the left, while achieves better performance on the Te set

Main Framework


The main process of doing subsampling is as follows:

  • (a) first train a model on the full data set

  • (b) compute the influence function (IF) for each sample in training set

  • (c) compute the sampling probability of each sample in training set

  • (d) doing subsampling and train a subset-model and the reduced data set


Other Interesting Stuff

To accelerate the computation of Influence Function, we modify the original scipy/optimize module to realize the Hessian-free Preconditioned Truncated Newton Method [Hsia et al., 2018] for Logistic Regression.

The details can be referred to ./optimize/optimize.py.

Usage & Demo

For simple Demo on MNIST and Breast-cancer

We have prepared simple demo on Logistic Regression and SVM, see in Demo_on_Logistic_Regression.ipynb and Demo_on_SVM.ipynb

The experiment results would be shown as following. We have the Sig-UIDS obtain ACC and AUC much better than the Full-set-model and subset-model obtained by random sampling.

============================================================
MNIST: Result Summary on Te (ACC and AUC)
[SigUIDS]  acc 0.984281, auc 0.998802 # 4994
[Random]   acc 0.900139, auc 0.950359 # 4995
[Full]     acc 0.916782, auc 0.961824 # 8325
============================================================

For other data sets

For other data sets, we provide a simple tool to proceed the data set from the raw text to the processed scipy.sparse matrix, which supports pretty large and high dimensional data set in practice (more than 10-million-feature data set):

=================================================================
python -u process_data.py -p 2 -b 10 -n 1000 -f fm data/XXX.txt
Args:
-p: # of threads used in processing
-b: # of lines processed in a thread
-n: the maximum # of features for the raw data set
-f: should be "fm" or "ffm" indicating the format of the raw text data, the "fm" stores one sample in a line as "feature_id:value", while the "ffm" has "field_id:feature_id:value".
=================================================================

You can find a toy ffm formatted data set on ./data/toy.ffm.

Then you could refer to the demo notebook to do your own experiments on other data sets ^ ^

Acknowledgement

We especially thank the insights and advice from Professor Chih-Jen Lin for the theory and writing of this work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.