Giter Site home page Giter Site logo

rdspring1 / mission Goto Github PK

View Code? Open in Web Editor NEW
13.0 6.0 6.0 58 KB

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

License: Apache License 2.0

Makefile 2.04% C++ 96.66% Objective-C 1.30%
dna-metagenomics count-sketches large-scale-learning feature-extraction hashing compressive-sensing

mission's Introduction

MISSION

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

An ICML 2018 paper by Amirali Aghazadeh*, Ryan Spring*, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, Richard G. Baraniuk

* These authors contributed equally and are listed alphabetically.

How-To-Run + Code Versions

  1. Build executables by running Makefile
  2. Mission Logistic Regression
// Hyperparameters
// Size of Top-K Heap
const size_t TOPK = (1 << 14) - 1;

// Size of Count-Sketch Array
const size_t D = (1 << 18) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 5e-1;

./mission_logistic train_data test_data
  1. Fine-Grained Mission Softmax Regression
// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 20) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./fine_mission_softmax train_data test_data
  1. Coarse-Grained Mission Softmax Regression
// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 22) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-1;

// Length of String Feature Representation
const size_t LEN = 12;

./coarse_mission_softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data
  1. Feature Hashing Softmax Regression
// Hyperparameters

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data

Optimizations

  • Mission streams in the dataset via Memory-Mapped I/O instead of loading everything directly into memory -
    Necessary for Tera-Scale Datasets
  • AVX SIMD optimization for fast Softmax Regression
  • The code is currently optimized for the Splice-Site and DNA Metagenomics datasets.

Mission Softmax Regression

  1. Fine-Grained Feature Set - Each class maintains a separate feature set, so there is a top-k heap for each class.
  2. Coarse-Grained Feature Set - All the classes share a common set of features, so there is only one top-k heap. -
    Each feature is measured by its L1 Norm for all classes.
  3. Data Parallelism - Each worker maintains a separate heap, while aggregating gradients in the same count-sketch.

Datasets

  1. KDD 2012
  2. RCV1
  3. Webspam - Trigram
  4. DNA Metagenomics
  5. Criteo 1TB
  6. Splice-Site 3.2TB

mission's People

Contributors

rdspring1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mission's Issues

may you share , how you read data from criteo terabyte-click-logs-

great code thanks

0
will it work on windows local computer?
1
may you help clarify
"All data files are formatted using the VW input format"
but you reference to criteo data for to libsvm format data?
in
Datasets
KDD 2012
RCV1
Webspam - Trigram
DNA Metagenomics
Criteo 1TB
Splice-Site 3.2TB

may you share , how you read data from criteo terabyte-click-logs-
2

do you know other code examples to run criteo-1tb-benchmark fully locally , without spark?
kind of online learning?
for
https://labs.criteo.com/2013/12/download-terabyte-click-logs-2/

MMAP issue when running criteo_tb dataset

I am running 2.7TB criteo_tb dataset. I set topK to 250000 and D (1 << 20) - 1, but when I run the program, it throws MMAP Failure in fast parser. But in my understanding, fast parser should only load 1 page at a time?

I print the errno. It's 9, which corresponds to "Bad file number".

Getting error "error: use of undeclared identifier 'aligned_alloc'" during make command

Hi,

I am trying to run the MISSION code on Mac OS.
When I am building the executables using make file then I am getting this error:
MISSION/src/include/cms.h:58:25: error: use of undeclared identifier 'aligned_alloc'
data = (float*) aligned_alloc(32, sizeof(float)*SIZE);

I am using this CMakeLists.txt:
cmake_minimum_required(VERSION 3.15) project(mission) set(CMAKE_CXX_STANDARD 11) include_directories("/usr/local/include") include_directories( "/usr/local/opt/llvm/include") link_directories("/usr/local/lib" "/usr/local/opt/llvm/lib") #add_executable(mission /Users/neerajsharma/my_work/umass/umass_study/1st_sem/CS689/MISSION/src/mission_logistic.cpp) add_executable(mission /Users/neerajsharma/my_work/umass/umass_study/1st_sem/CS689/MISSION/src/mission_logistic.cpp /Users/neerajsharma/my_work/umass/umass_study/1st_sem/CS689/MISSION/src/fine_mission_softmax.cpp) include_directories("/Users/neerajsharma/my_work/umass/umass_study/1st_sem/CS689/MISSION/src/include")

I am using this cmake command:
cmake -D CMAKE_C_COMPILER=/usr/local/bin/gcc-9 -D CMAKE_CXX_COMPILER=/usr/bin/g++ .

@rdspring1 Can you please help me in solving this issue?

error: use of undeclared identifier 'MAP_POPULATE'

I am interested in your paper. However, when I tried to run this code, I got some problems. When I type make all, I got following outputs:

rm -rf MurmurHash.o
rm -rf fast_parser.o
rm -rf util.o
rm -rf mission_logistic
rm -rf fine_mission_softmax
rm -rf coarse_mission_softmax
rm -rf softmax
g++ -Wall --std=c++14 -O3 -Iinclude/ -o fast_parser.o -c fast_parser.cpp
fast_parser.cpp:31:66: error: use of undeclared identifier 'MAP_POPULATE'
                addr = mmap(NULL, pg_size, PROT_READ, MAP_FILE | MAP_PRIVATE | MAP_POPULATE, fd, offset);
                                                                               ^
1 error generated.
make: *** [fast_parser] Error 1

How can I fix this? I think MAP_POPULATE should be given in mman.h ?

The loss doesn't decrease on the KDD12 dataset

Hi, I have run the code on the KDD12, but the loss didn't decrease even after reading over all data. I have also tried to increase the D to 1 << 26 - 1 (paper mentions "The size of the MISSION and FH models were set to the nearest power of 2 greater than the number of features in the dataset"). But it doesn't help.

Any suggestion will be greatly appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.