Giter Site home page Giter Site logo

distributed_ranksvm's Introduction

This directory includes sources used in the following thesis:

Wei-Lun Huang, Analysis and Implementation of Large-scale
Linear RankSVM in Distributed Environments, 2015.

It supports distributed L2-regularized L2-loss linear rankSVM for the following 
two kinds of splits:
		
		Query-wise split (QW)
		Feature-wise split (FW)


System Requirement
==================
This experiment is supposed to be run on UNIX machines. The following
commands are required:
- UNIX commands (mv, ln, cp, cat, apt-get, etc)
- bash
- g++
- wget
- make
- python2.6 or newer versions except python 3.x.
- Open MPI


Quick start
============
1.  Setup distributed environments & solvers.
    (See Setup Distributed Environment & Solvers section for details)

2.  Prepare distributed data sets.
    (See Prepare Distributed Data Sets section for details)

3.  Edit run_all.sh to set the machinefile you want to use
    and the command you want to run.

4.  Edit `nr_machine_list' in plot_speedup.py to set the 
    number of machines you want to compare speedup.

5.  Run all of the experiments.

	$ ./run_all.sh


Introduction
============
We implement TreeTron-QW, TreeTron-FW, which are distributed trust-region 
Newton method (TRON) with query-wise and feature-wise splits, respectively.

For query-wise split (default), please type

$ mpirun -n # --machinefile conf/machinefile# solver/train -S 0 your_data

For feature-wise split, please type

$ mpirun -n # --machinefile conf/machinfile# solver/train -S 1 your_data

For the detailed explaination of parameters, please type

$ mpirun -n # --machinefile conf/machinfile# solver/train,

where # is the number of (splits) machines your want to use.


Compare TreeTron-QW/TreeTron-FW/TreeTron 
========================================
Edit 'all_data' in parameter.py to indicate the data for comparison. Remove the
data sets that you are not interested in. For example, change

data = ['MSLR','YAHOO_SET1','MQ2007-list','MQ2008-list', 'MQ2007', 'MQ2008', 'AMAZON_poly2']

to

data = ['MSLR','AMAZON_poly2']

After deciding data sets, you must prepare the data sets and
install the solvers as well as the tools. Please see Sections 'Prepare Data
Sets for Experiments', 'Installation for Experiments' for more details.

Type

$ python ./run_exp.py conf/machinefile#, 

where # is the number of machines (default 2) 
you would like to use.

Type

$ python ./plot_qw_fw_comparison.py

to compare solvers. The results are stored in the 'figure/' directory.

Plot speedup with respect to number of machines
================================================
Type 

$ python ./run_exp.py conf/machinefile#, 

where # is the number of machines 
(1, 2, 4, 8, 16 in the default setting).

Type

$ python ./plot_speedup.py QW

for query-wise split, or

$ python ./plot_speedup.py FW

for feature-wise split to show the speedup. 
The results are stored in the 'figure_speedup/' directory.


Prepare Distributed Data Sets 
=============================
This section is modified from the README in the experiment code of the following paper: 
Ching-Pei Lee and Chih-Jen Lin, Large-scale Linear RankSVM, 2013.

Please download those data sets you are interested in from the following sites
and put them in the directory './data/'.
You do not need to extract the zip/rar/tgz files.
After all downloads are finished, type

$ python ./gen_data.py

The script will extract all data sets and conduct pre-processing tasks.
You can also edit 'gen_data.py' to comment the data sets you are not interested in. For example:
#data_dict['YAHOO_SET1'] = 'YAHOO/'

For LETOR data sets:
Download the rar files from
http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2007.rar
http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008.rar
http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2007-list.rar
http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008-list.rar
The four urls are for MQ2007, MQ2008, MQ2007-list and MQ2008-list,
respectively.

For MSLR data set:
Download the zip file from
http://research.microsoft.com/en-us/um/beijing/projects/mslr/data/MSLR-WEB30K.zip

For Yahoo Learning to Rank Challenge data sets:
Download the tgz file by entering the following page
http://webscope.sandbox.yahoo.com/catalog.php?datatype=c
and select the data "C14 - Yahoo! Learning to Rank Challenge (421 MB)".

For AMAZON Employee Access Challenge data sets:
Download the csv files by entering the following page
https://www.kaggle.com/c/amazon-employee-access-challenge/data 
and put it in the directory data/AMAZON/.

Type 

$ python ./rk_split.py QW your_machinefile your_data 

for query-wisely or 

$ python ./rk_split.py FW your_machinefile your_data 

for feature-wisely splitting your data.


Setup Distributed Environments & Solvers
=======================================
To start the experiemnts, you must setup the environments and solvers first.
Type 

$ apt-get install openmpi-bin libopenmpi-dev

on all machines for installing Open MPI.

Edit `conf/machinefile#' (# is the number of machines you want to use)
to be with the following format (each line is a hostname or an ip):
	
	localhost
	machine2
	machine3

	Note: Each line should represent an unique machine.

Type

$ mpirun -n # --machinefile conf/machinefile# mkdir -p `pwd`;
  util/sync_dir.py "`pwd`/solver" `pwd` conf/machinefile#

to copy the sources of solver to all machines.

Type

$ mpirun -n # --machinefile conf/machinefile# make -C solver

to compile solvers on all machines.

distributed_ranksvm's People

Contributors

ntu519198 avatar

Stargazers

Jiawen Wang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.