quinngroup / dr1dl-pyspark Goto Github PK
View Code? Open in Web Editor NEWDictionary Learning in PySpark
License: Apache License 2.0
Dictionary Learning in PySpark
License: Apache License 2.0
On line 106, where u_old
is defined as a random T-length vector, the pseudocode indicates the vector should be 0-mean and unit-length, but the mean is never subtracted off. Please add that operation.
Import statements in Python files need to follow three guidelines:
For example:
import numpy as np
import numpy.testing
import numpy.linalg as sla
import argparse
is incorrect. The correct ordering would be:
import argparse
import numpy as np
import numpy.linalg as sla
import numpy.testing
from package import function
, as this creates potential namespace collisions. Rather, use the import package.subpackage as alias
syntax.For example:
from numpy import linalg as sla
is incorrect. Rather, use this formulation:
import numpy.linalg as sla
for copying a vector into a different matrix actually I couldn't find a numpy function , numpy.tile can not give a solution to me so I had to convert line by line their code and creating a function which deals with normall arrays in python. please let me know if there is any numpy solution for that .
This step of the algorithm involves computing the outer product of two vectors u
and v
and subtracting that product off the distributed (RDD) matrix S
.
This is tough, because multiplying u
and v
will result in a matrix with the same dimensions as S
; thus, we cannot perform typical in-core multiplication of these vectors.
Instead, we can broadcast both vectors over the cluster and perform an element-wise subtraction using a single map
.
u
and v
to the workers, e.g. sc.broadcast(u)
and sc.broadcast(v)
.map
over the RDD.u * v
.This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.
Still, this can be done in a fairly straightforward manner.
u
to be multiplied, e.g. sc.broadcast(u)
.flatMap
over the RDD.u
.flatMap
instead of map
).reduceByKey
will then sum up the values for each key, which correspond to the elements of the resulting vector u
.Review current Spark implementation and identify points in the code that might be resulting in erroneous results.
The PySpark API has improved considerably in the last several months--there are now several data structures and distributed methods that can be used in native PySpark.
For generating random vectors / matrices:
Distributed data structures and primitives:
However, the thunder-project also has very mature Python-based distributed linear algebra structures and methods built on top of Spark that we can use.
Line 118: rather than invoke a O(n) array copy routine, just update the pointer to u_old
:
u_old = u_new
Finish the full pseudocode for the C++ rank-1 decomposition.
There's no reason to keep code that is commented out in the repository. git's versioning system will retain all the development history of the file in case we need to revert the version to a previous one.
As a general rule, you should never commit code that has lines of commented-out code.
In favor of vectorized operations (via numpy and scipy).
To set up your environment for Spark, here's what I recommend:
Set up an Ubuntu virtual machine (~20GB hard disk space, ~4GB memory, at least 2 CPU cores) with these settings. You'll likely also need to install git
and Sublime Text again on the virtual machine so you can do your development there. However this is probably still the easiest way to work with Spark.
I think we need to provide an exception and try in case of not importing the input elements correctly by user , As of today, I'm not a professional programmer in python so I need your help on program for making that exception if needed.
@magsol
in starting with spark , I'm so sorry if my question are too simple , since I'm very beginner in spark, I need your supports too.
we can start from importing a text file and making an RDD for this command as follows :
S = sc.txtFile('../../file_s.txt)
Am I right ?
is it needed to use SC.paralellize() at the beginning ?
The very first step of the algorithm, before the loops even begin, is to whiten the columns of the input matrix S
. This means subtracting off the mean and rescaling the columns to have unit norms.
Luckily, thunder-project has the perfect function: http://thunder-project.org/thunder/docs/generated/thunder.RowMatrix.html#thunder.RowMatrix.zscore . Make sure we specify axis = 1
(the column axis) and this will perform the whitening.
In preparation to start Milestone 3, please:
dictLearningSpark.py
to r1dl.py
(to give parity of the script with the original C++ version).r1dl_spark.py
for the Milestone 3 work.Dear xiang :
Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?
if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());
There are many small syntactical fixes that need to be made for the code to be production-ready and open sourced. They include (but are not limited to):
x = 5
, rather than x=5
)var1, var2
instead of var1,var2
)import numpy as np
x = np.array([1, 2, 3,
4, 5, 6, 7, 8, 9, 10],
dtype = np.float) # Any continuation of one line should be indented
def square(x):
"""
Returns the square (x^2) of the input argument.
Parameters
---------------
x : float
The base value to be squared.
Returns
----------
x * x : float
The square of the input argument.
"""
return x * x
Please provide a docstring for op_getResidual
using the same format as the other functions.
Need some test data to use as testing input for the prototype.
as I Know there is three way to copy a vector to another one in python:
1.v1=v2
2.v1[:]=v2 (it seems its faster than 1)
3.numpy.copyto(v2, v1, casting='same_kind', where=None)
I think the 3rd one would be the best but unfortunately It doent work , I mean afetr testing still the value of v2 is the initial value and not the copied one(python 3.5).
should I use the 2nd method ?
The descriptions of the parameters to op_selectTopR
and op_getResidual
are not very descriptive, e.g. vct_input: indicating input vector
. Please improve these descriptions to provide a new user with intuition for exactly what the parameter is, how it is used, and what its larger role is in the overall program.
Finish filling out the command-line arguments in terms of the argparse
package in Python.
Please post the errors you're getting that are preventing you from pushing to github.
(this relies on #4)
Implement a segment that will successfully read in fMRI data off disk.
for output files I already used the np.savetext with xiang's format as '%.50If\t" , I think this is the best option :)
np.savetxt(file_D, D, fmt='%.50lf\t')
np.savetxt(file_Z, Z, fmt='%.50lf\t')
Test scalability of Spark implementation on a cluster (LJ cluster at UGA; AWS EC2 clusters using the thunder-ec2 scripts). Identify bottlenecks in the code.
Use either the nose
or unittest
packages to conduct focused unit testing.
The NumPy library has built-in array-array multiplication operations; let's use it.
There are some import statements that are never used: sys
, StringIO
, math
, and random
in particular. Possibly others.
Dear Dr. quinn:
For normalization functions , It seems that the functions are mostly heuristic and designed based on experiences to be fit with this problem. Thus it's not possible to find an exact function equivalent with this functions in numpy or scipy . Therefore I Think I should convert line by line the normalization functions of xiang's code . for example I wrote the following one for "stat_normalize2l2NormVCT" as :
import numpy as np
vct_input = np.array([0,1,2,5,0],dtype=float)
T=5
double_l2norm = 0
for t in range(T):
double_l2norm = vct_input[t]*vct_input[t] + double_l2norm
print(vct_input[t])
double_l2norm = np.sqrt(double_l2norm)
for t in range (T):
vct_input[t] = vct_input[t]/double_l2norm
print(vct_input)
===================={ out put}==========
0.0
1.0
2.0
5.0
0.0
[ 0. 0.18257419 0.36514837 0.91287093 0. ]
[Finished in 0.3s]
Each indented line is using a tab \t
character; these should be replaced with 4 space characters per indentation level. Sublime should tell you if the indentation is using a tab character (looks like a straight horizontal line when highlighted) or spaces (a series of dots); make sure it's the latter.
At first glance, I suspect length
and pnumber
can be inferred from the data read into the program.
There are a lot of Python-based program profilers we can use to benchmark the performance of the Python port. Among them:
We should definitely make use of one or more of these.
Can we use following codes to infer P and T :
import argparse
from pyspark import SparkContext, SparkConf
from pyspark.mlib.linalg.distributed import RowMatrix
.
.
.
.
S = sc.textFile("file_s")
y = RowMatrix(S)
T = y.numRows()
P = y.numCols()
For a given input:
Z.txt
fileThis will help identify discrepancies between the two implementations.
We would need the code for parallel matrix-vector multiplication and the corresponding test cases to see whether we could get a performance boost from doing so, which would support the development of the parallel version of r1DL and sccDL.
Xiang,
I'm having a difficult time understanding what the nature of the input data to the Spark implementation should be. Until now we've been testing with smaller datasets that are "tall and skinny", i.e. a large number of rows but a small number of columns. When I wrote the pseudocode for the method on this repo's wiki, it was my assumption that T >>> P
, where T
is the number of rows in S
and P
is the number of columns.
However, in the larger datasets (including the MOTOR dataset), the number of columns is significantly larger, hence my thought that the data needed to be transposed. But it seems like all the data in MOTOR is that way: "short-and-wide", or the number of rows is very small relative to the number of columns.
This presents some problems with the current implementation, since we distributed the data by rows. Since the data are dense, that means many fewer nodes, each with very large, very dense vectors. We'll need to rethink the implementation to take advantage of the short-and-wide input structure IF this is the case.
So we need some clarifications on the nature of the input data.
@LindberghLi
Dear Xiang the test results for test 1 is as follows, would you please check it and let me the results of your consideration ?
z = [[-0.27229654 0.19700459 -0.19796643 -0.30472146 0.13367598]
[-0.22809997 -0.26494267 -0.19049078 -0.32886685 0.16040115]
[-0.17420122 -0.23350887 -0.19283139 -0.37827604 -0.20294629]
[ 0.17752543 -0.20941425 -0.15429241 -0.40064487 -0.22920292]
[-0.11979997 -0.22540733 -0.17153496 -0.40088513 -0.2192995 ]]
D = [[-0.10122219 -0.10122219 -0.10122219 -0.10122219 -0.10122219]
[ 0.1053534 0.1053534 0.1053534 0.1053534 0.1053534 ]
[ 0.0181571 0.0181571 0.0181571 0.0181571 0.0181571 ]
[ 0.14481954 0.14481954 0.14481954 0.14481954 0.14481954]
[-0.12595233 -0.12595233 -0.12595233 -0.12595233 -0.12595233]]
The PySpark API provides a couple of sorting primitives:
The last one in particular looks promising for our uses (here's a StackOverflow question on its use).
Results should be identical (within a tolerance threshold).
Quick Python script to parse and compare the two outputs.
There are many lines in the file that have trailing spaces (extra space characters at the end of a line), as well as spaces between characters that should be deleted (between an opening parenthesis (
and the argument of a function).
For example:
def func( arg1, arg2 ):
has extra spaces between the parentheses and the argument names. The correct version should be:
def func(arg1, arg2):
There should, however, be spaces between operators (such as +
, -
, *
, /
, =
, and ,
).
For example:
print('Analyzing component ',(m+1),'...')
is incorrect, as there are no spaces between the operator +
and the operands, or any spaces after the commas. The correct version should be:
print('Analyzing component ', (m + 1), '...')
@LindberghLi:
I'm helping @MOJTABAFA with unit testing, and in particular the selection of the top R
elements seems to be critical. In checking that our function works correctly, I've been referencing the C++ implementation (line 115) and I have a question.
Specifically, if nth_element
provides a partial sorting of the array wherein the larger elements are on the left of the nth element, and the smaller elements are to the right, why then is the loop over all N
elements on line 122 needed? Surely you only need to loop over the first R
elements (guaranteed to evaluate to true
in the if
statement on line 124). Or am I missing something?
There are some variables that are defined but never used. Particular examples include RAND_MAX
, file_summary
, and totoalResidual
.
Also, I believe totoalResidual
is misspelled. But even its correctly-spelled version is never used, so the variable should either be used somewhere or deleted entirely.
@LindberghLi
For running and debugging the python code , I need following information to make sure about the functionality of the program, after that we can work on benchmarking of this program and the C++ version one :
Please provide me this info as soon as you can .
Let's use good documentation practices and coding style from the start. In particular,
code1.py
to be something representative of what the script is doing.Based on today negotiation with Xiang , It seems that our reandvct function could be easily translated to a line instruction in python. Today I found that the reason why xiang used the RAND_MAX is only to normal the random number between(-1,1), and the RAND_MAX gives us the maximum possible random.However in python we have a random generator which gives us a number between 0 and 1 so we don't need the RAND_MAX . So, do you have any idea about how to change the "stat_randVCT"
thanks.
Dear xiang :
Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?
if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());
Install the SublimeLinter plugin for Sublime Text 3, if you have not done so already.
"pep8": {
"@disable": false,
"args": [],
"excludes": [],
"ignore": "E302,E251,E501,E701,E128,W391,E265",
"max-line-length": null,
"select": ""
This goes in the Preferences -> Package Settings -> SublimeLinter -> User Settings
file. This will eliminate some warnings from your IDE.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.