quinngroup / dr1dl-pyspark Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 314 KB

Dictionary Learning in PySpark

License: Apache License 2.0

Python 99.73% Shell 0.27%

dr1dl-pyspark's People

Contributors

Stargazers

Watchers

Forkers

arunbalachandran

dr1dl-pyspark's Issues

u_old is not 0-mean

On line 106, where u_old is defined as a random T-length vector, the pseudocode indicates the vector should be 0-mean and unit-length, but the mean is never subtracted off. Please add that operation.

Fix imports

Import statements in Python files need to follow three guidelines:

Imports should be listed in alphabetical order by root package name, then all subpackages.

For example:

import numpy as np
import numpy.testing
import numpy.linalg as sla
import argparse

is incorrect. The correct ordering would be:

import argparse
import numpy as np
import numpy.linalg as sla
import numpy.testing

No comments or commented-out import statements.
Do not use from package import function, as this creates potential namespace collisions. Rather, use the import package.subpackage as alias syntax.

For example:

from numpy import linalg as sla

is incorrect. Rather, use this formulation:

import numpy.linalg as sla

copying vector to matrix

for copying a vector into a different matrix actually I couldn't find a numpy function , numpy.tile can not give a solution to me so I had to convert line by line their code and creating a function which deals with normall arrays in python. please let me know if there is any numpy solution for that .

P4: Deflation of matrix

This step of the algorithm involves computing the outer product of two vectors u and v and subtracting that product off the distributed (RDD) matrix S.

This is tough, because multiplying u and v will result in a matrix with the same dimensions as S; thus, we cannot perform typical in-core multiplication of these vectors.

Instead, we can broadcast both vectors over the cluster and perform an element-wise subtraction using a single map.

Broadcast u and v to the workers, e.g. sc.broadcast(u) and sc.broadcast(v).
Run a map over the RDD.
In each mapper, generate a new row vector by subtracting the current one from the corresponding row that would be generated through the outer product of u * v.
Return the row vector to create a new, deflated distributed RowMatrix (RDD).

P2: Vector-matrix multiplication

This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.

Still, this can be done in a fairly straightforward manner.

As in P1, broadcast the array u to be multiplied, e.g. sc.broadcast(u).
Run a flatMap over the RDD.
Each flatMap worker multiply its row of the matrix with the corresponding element of the broadcasted vector u.
Each value of the resulting vector will be outputted, keyed by its element index (hence the need for flatMap instead of map).
A reduceByKey will then sum up the values for each key, which correspond to the elements of the resulting vector u.

Code review of Spark implementation

Review current Spark implementation and identify points in the code that might be resulting in erroneous results.

LinAlg operations in PySpark

The PySpark API has improved considerably in the last several months--there are now several data structures and distributed methods that can be used in native PySpark.

For generating random vectors / matrices:

Distributed data structures and primitives:

However, the thunder-project also has very mature Python-based distributed linear algebra structures and methods built on top of Spark that we can use.

Replace numpy.copyto with equality

Line 118: rather than invoke a O(n) array copy routine, just update the pointer to u_old:

u_old = u_new

Complete pseudocode of C++

Finish the full pseudocode for the C++ rank-1 decomposition.

Delete commented-out code

There's no reason to keep code that is commented out in the repository. git's versioning system will retain all the development history of the file in case we need to revert the version to a previous one.

As a general rule, you should never commit code that has lines of commented-out code.

Eliminate all possible explicit loops

In favor of vectorized operations (via numpy and scipy).

Spark development environment setup and install

To set up your environment for Spark, here's what I recommend:

VirtualBox to create a virtual machine
Ubuntu 14.04.3 LTS Desktop to install on VirtualBox
Spark 1.5.2, pre-built against Hadoop 2.6.x and later.
Anaconda for linux

Set up an Ubuntu virtual machine (~20GB hard disk space, ~4GB memory, at least 2 CPU cores) with these settings. You'll likely also need to install git and Sublime Text again on the virtual machine so you can do your development there. However this is probably still the easiest way to work with Spark.

Exception and Try for importing the argument

I think we need to provide an exception and try in case of not importing the input elements correctly by user , As of today, I'm not a professional programmer in python so I need your help on program for making that exception if needed.

starting with spark

@magsol
in starting with spark , I'm so sorry if my question are too simple , since I'm very beginner in spark, I need your supports too.
we can start from importing a text file and making an RDD for this command as follows :

S = sc.txtFile('../../file_s.txt)

Am I right ?
is it needed to use SC.paralellize() at the beginning ?

P3: Column-wise whitening of S

The very first step of the algorithm, before the loops even begin, is to whiten the columns of the input matrix S. This means subtracting off the mean and rescaling the columns to have unit norms.

Luckily, thunder-project has the perfect function: http://thunder-project.org/thunder/docs/generated/thunder.RowMatrix.html#thunder.RowMatrix.zscore . Make sure we specify axis = 1 (the column axis) and this will perform the whitening.

Change script names

In preparation to start Milestone 3, please:

change the name of the current script from dictLearningSpark.py to r1dl.py (to give parity of the script with the original C++ version).
create a new script called r1dl_spark.py for the Milestone 3 work.

the STD::

Dear xiang :

Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?

if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());

Python coding convention improvements

There are many small syntactical fixes that need to be made for the code to be production-ready and open sourced. They include (but are not limited to):

Putting a space between each binary operator and its operands (e.g. x = 5, rather than x=5)
A space after every comma (e.g. var1, var2 instead of var1,var2)
Line continuations should be indented from their starting line, e.g.

import numpy as np
x = np.array([1, 2, 3,
    4, 5, 6, 7, 8, 9, 10],
    dtype = np.float) # Any continuation of one line should be indented

Indentation level should be 4 spaces (no tabs)
Each function should have an accompanying docstring, e.g.

def square(x):
    """
    Returns the square (x^2) of the input argument.

    Parameters
    ---------------
    x : float
        The base value to be squared.

    Returns
    ----------
    x * x : float
        The square of the input argument.
    """
    return x * x

op_getResidual needs a docstring

Please provide a docstring for op_getResidual using the same format as the other functions.

Add some toy test data

Need some test data to use as testing input for the prototype.

copying a vector to vector

as I Know there is three way to copy a vector to another one in python:

1.v1=v2
2.v1[:]=v2 (it seems its faster than 1)
3.numpy.copyto(v2, v1, casting='same_kind', where=None)

I think the 3rd one would be the best but unfortunately It doent work , I mean afetr testing still the value of v2 is the initial value and not the copied one(python 3.5).
should I use the 2nd method ?

Improve descriptions of variables in docstring

The descriptions of the parameters to op_selectTopR and op_getResidual are not very descriptive, e.g. vct_input: indicating input vector. Please improve these descriptions to provide a new user with intuition for exactly what the parameter is, how it is used, and what its larger role is in the overall program.

Use argparse for command-line arguments

Finish filling out the command-line arguments in terms of the argparse package in Python.

push throwing errors

Please post the errors you're getting that are preventing you from pushing to github.

Read in dictionary data

(this relies on #4)

Implement a segment that will successfully read in fMRI data off disk.

out put files

for output files I already used the np.savetext with xiang's format as '%.50If\t" , I think this is the best option :)
np.savetxt(file_D, D, fmt='%.50lf\t')
np.savetxt(file_Z, Z, fmt='%.50lf\t')

Performance analysis

Test scalability of Spark implementation on a cluster (LJ cluster at UGA; AWS EC2 clusters using the thunder-ec2 scripts). Identify bottlenecks in the code.

Create unit tests for all methods

Use either the nose or unittest packages to conduct focused unit testing.

Convert `op_VCTbyMTX` and `op_MTXbyVCT` to `np.dot` operations.

The NumPy library has built-in array-array multiplication operations; let's use it.

Remove unused imports

There are some import statements that are never used: sys, StringIO, math, and random in particular. Possibly others.

Normalization functions

Dear Dr. quinn:
For normalization functions , It seems that the functions are mostly heuristic and designed based on experiences to be fit with this problem. Thus it's not possible to find an exact function equivalent with this functions in numpy or scipy . Therefore I Think I should convert line by line the normalization functions of xiang's code . for example I wrote the following one for "stat_normalize2l2NormVCT" as :

import numpy as np
vct_input = np.array([0,1,2,5,0],dtype=float)
T=5
double_l2norm = 0
for t in range(T):
double_l2norm = vct_input[t]*vct_input[t] + double_l2norm
print(vct_input[t])
double_l2norm = np.sqrt(double_l2norm)

for t in range (T):
vct_input[t] = vct_input[t]/double_l2norm
print(vct_input)
===================={ out put}==========
0.0
1.0
2.0
5.0
0.0
[ 0. 0.18257419 0.36514837 0.91287093 0. ]
[Finished in 0.3s]

Replace all tab indentations with 4 spaces

Each indented line is using a tab \t character; these should be replaced with 4 space characters per indentation level. Sublime should tell you if the indentation is using a tab character (looks like a straight horizontal line when highlighted) or spaces (a series of dots); make sure it's the latter.

Eliminate command-line variables that can be inferred

At first glance, I suspect length and pnumber can be inferred from the data read into the program.

Run profiler

There are a lot of Python-based program profilers we can use to benchmark the performance of the Python port. Among them:

Built-in cProfile
RunSnakeRun (a front-end UI for visualizing the data dumps of cProfile)
pycallgraph
line_profiler
memory_profiler
Python object graphs

We should definitely make use of one or more of these.

importing the file and inferring the P and T

Can we use following codes to infer P and T :

import argparse
from pyspark import SparkContext, SparkConf
from pyspark.mlib.linalg.distributed import RowMatrix
.
.
.
.
    S = sc.textFile("file_s")
    y = RowMatrix(S)
    T = y.numRows()
    P = y.numCols()

Comparison scripts

For a given input:

Run the Python (not Spark) script on the input to compute the output OR have a pre-made Z.txt file
Generate output with the Spark version
Compute statistics of the two outputs: how different are they? Where are they different?

This will help identify discrepancies between the two implementations.

P1: Code stubs for parallel matrix-vector multiplication

We would need the code for parallel matrix-vector multiplication and the corresponding test cases to see whether we could get a performance boost from doing so, which would support the development of the parallel version of r1DL and sccDL.

Nature of input matrices

Xiang,

I'm having a difficult time understanding what the nature of the input data to the Spark implementation should be. Until now we've been testing with smaller datasets that are "tall and skinny", i.e. a large number of rows but a small number of columns. When I wrote the pseudocode for the method on this repo's wiki, it was my assumption that T >>> P, where T is the number of rows in S and P is the number of columns.

However, in the larger datasets (including the MOTOR dataset), the number of columns is significantly larger, hence my thought that the data needed to be transposed. But it seems like all the data in MOTOR is that way: "short-and-wide", or the number of rows is very small relative to the number of columns.

This presents some problems with the current implementation, since we distributed the data by rows. Since the data are dense, that means many fewer nodes, each with very large, very dense vectors. We'll need to rethink the implementation to take advantage of the short-and-wide input structure IF this is the case.

So we need some clarifications on the nature of the input data.

Test Results

@LindberghLi
Dear Xiang the test results for test 1 is as follows, would you please check it and let me the results of your consideration ?

z = [[-0.27229654 0.19700459 -0.19796643 -0.30472146 0.13367598]
[-0.22809997 -0.26494267 -0.19049078 -0.32886685 0.16040115]
[-0.17420122 -0.23350887 -0.19283139 -0.37827604 -0.20294629]
[ 0.17752543 -0.20941425 -0.15429241 -0.40064487 -0.22920292]
[-0.11979997 -0.22540733 -0.17153496 -0.40088513 -0.2192995 ]]

D = [[-0.10122219 -0.10122219 -0.10122219 -0.10122219 -0.10122219]
[ 0.1053534 0.1053534 0.1053534 0.1053534 0.1053534 ]
[ 0.0181571 0.0181571 0.0181571 0.0181571 0.0181571 ]
[ 0.14481954 0.14481954 0.14481954 0.14481954 0.14481954]
[-0.12595233 -0.12595233 -0.12595233 -0.12595233 -0.12595233]]

Top R sorting in PySpark

The PySpark API provides a couple of sorting primitives:

The last one in particular looks promising for our uses (here's a StackOverflow question on its use).

Compare output from C++ and Python versions

Results should be identical (within a tolerance threshold).

Quick Python script to parse and compare the two outputs.

http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.testing.assert_approx_equal.html#numpy.testing.assert_approx_equal

Fix use of spaces between characters and at ends of lines

There are many lines in the file that have trailing spaces (extra space characters at the end of a line), as well as spaces between characters that should be deleted (between an opening parenthesis ( and the argument of a function).

For example:

def func( arg1, arg2 ):

has extra spaces between the parentheses and the argument names. The correct version should be:

def func(arg1, arg2):

There should, however, be spaces between operators (such as +, -, *, /, =, and ,).

For example:

print('Analyzing component ',(m+1),'...')

is incorrect, as there are no spaces between the operator + and the operands, or any spaces after the commas. The correct version should be:

print('Analyzing component ', (m + 1), '...')

Select top R

@LindberghLi:

I'm helping @MOJTABAFA with unit testing, and in particular the selection of the top R elements seems to be critical. In checking that our function works correctly, I've been referencing the C++ implementation (line 115) and I have a question.

Specifically, if nth_element provides a partial sorting of the array wherein the larger elements are on the left of the nth element, and the smaller elements are to the right, why then is the loop over all N elements on line 122 needed? Surely you only need to loop over the first R elements (guaranteed to evaluate to true in the if statement on line 124). Or am I missing something?

Remove unused variables

There are some variables that are defined but never used. Particular examples include RAND_MAX, file_summary, and totoalResidual.

Also, I believe totoalResidual is misspelled. But even its correctly-spelled version is never used, so the variable should either be used somewhere or deleted entirely.

Running and debugging the Python Code

@LindberghLi
For running and debugging the python code , I need following information to make sure about the functionality of the program, after that we can work on benchmarking of this program and the C++ version one :

Input file (already you gave me)
The desired output file for that specific input ( for checking if the program works properly.
Other items values like epsilon, T, D, M, P, percentage of non zero elements and etc.,

Please provide me this info as soon as you can .

Improve documentation and style

Let's use good documentation practices and coding style from the start. In particular,

Rename code1.py to be something representative of what the script is doing.
Provide comments in the code describing the operations.

rand_vct function

@magsol

Based on today negotiation with Xiang , It seems that our reandvct function could be easily translated to a line instruction in python. Today I found that the reason why xiang used the RAND_MAX is only to normal the random number between(-1,1), and the RAND_MAX gives us the maximum possible random.However in python we have a random generator which gives us a number between 0 and 1 so we don't need the RAND_MAX . So, do you have any idea about how to change the "stat_randVCT"
thanks.

the STD::

Dear xiang :

Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?

if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());

Install SublimeLinter plugin

Install the SublimeLinter plugin for Sublime Text 3, if you have not done so already.

Configure SublimeLinter plugin with PEP exceptions

"pep8": {
                "@disable": false,
                "args": [],
                "excludes": [],
                "ignore": "E302,E251,E501,E701,E128,W391,E265",
                "max-line-length": null,
                "select": ""

This goes in the Preferences -> Package Settings -> SublimeLinter -> User Settings file. This will eliminate some warnings from your IDE.

quinngroup / dr1dl-pyspark Goto Github PK

dr1dl-pyspark's People

Contributors

Stargazers

Watchers

Forkers

dr1dl-pyspark's Issues

Recommend Projects

Recommend Topics

Recommend Org