Giter Site home page Giter Site logo

dr1dl-pyspark's People

Contributors

arunbalachandran avatar geofbot avatar magsol avatar milad181 avatar mojtabafa avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dr1dl-pyspark's Issues

u_old is not 0-mean

On line 106, where u_old is defined as a random T-length vector, the pseudocode indicates the vector should be 0-mean and unit-length, but the mean is never subtracted off. Please add that operation.

Fix imports

Import statements in Python files need to follow three guidelines:

  • Imports should be listed in alphabetical order by root package name, then all subpackages.

For example:

import numpy as np
import numpy.testing
import numpy.linalg as sla
import argparse

is incorrect. The correct ordering would be:

import argparse
import numpy as np
import numpy.linalg as sla
import numpy.testing
  • No comments or commented-out import statements.
  • Do not use from package import function, as this creates potential namespace collisions. Rather, use the import package.subpackage as alias syntax.

For example:

from numpy import linalg as sla

is incorrect. Rather, use this formulation:

import numpy.linalg as sla

copying vector to matrix

for copying a vector into a different matrix actually I couldn't find a numpy function , numpy.tile can not give a solution to me so I had to convert line by line their code and creating a function which deals with normall arrays in python. please let me know if there is any numpy solution for that .

P4: Deflation of matrix

This step of the algorithm involves computing the outer product of two vectors u and v and subtracting that product off the distributed (RDD) matrix S.

This is tough, because multiplying u and v will result in a matrix with the same dimensions as S; thus, we cannot perform typical in-core multiplication of these vectors.

Instead, we can broadcast both vectors over the cluster and perform an element-wise subtraction using a single map.

  1. Broadcast u and v to the workers, e.g. sc.broadcast(u) and sc.broadcast(v).
  2. Run a map over the RDD.
  3. In each mapper, generate a new row vector by subtracting the current one from the corresponding row that would be generated through the outer product of u * v.
  4. Return the row vector to create a new, deflated distributed RowMatrix (RDD).

P2: Vector-matrix multiplication

This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.

Still, this can be done in a fairly straightforward manner.

  1. As in P1, broadcast the array u to be multiplied, e.g. sc.broadcast(u).
  2. Run a flatMap over the RDD.
  3. Each flatMap worker multiply its row of the matrix with the corresponding element of the broadcasted vector u.
  4. Each value of the resulting vector will be outputted, keyed by its element index (hence the need for flatMap instead of map).
  5. A reduceByKey will then sum up the values for each key, which correspond to the elements of the resulting vector u.

LinAlg operations in PySpark

The PySpark API has improved considerably in the last several months--there are now several data structures and distributed methods that can be used in native PySpark.

For generating random vectors / matrices:

Distributed data structures and primitives:

However, the thunder-project also has very mature Python-based distributed linear algebra structures and methods built on top of Spark that we can use.

Delete commented-out code

There's no reason to keep code that is commented out in the repository. git's versioning system will retain all the development history of the file in case we need to revert the version to a previous one.

As a general rule, you should never commit code that has lines of commented-out code.

Spark development environment setup and install

To set up your environment for Spark, here's what I recommend:

Set up an Ubuntu virtual machine (~20GB hard disk space, ~4GB memory, at least 2 CPU cores) with these settings. You'll likely also need to install git and Sublime Text again on the virtual machine so you can do your development there. However this is probably still the easiest way to work with Spark.

Exception and Try for importing the argument

I think we need to provide an exception and try in case of not importing the input elements correctly by user , As of today, I'm not a professional programmer in python so I need your help on program for making that exception if needed.

starting with spark

@magsol
in starting with spark , I'm so sorry if my question are too simple , since I'm very beginner in spark, I need your supports too.
we can start from importing a text file and making an RDD for this command as follows :

S = sc.txtFile('../../file_s.txt)

Am I right ?
is it needed to use SC.paralellize() at the beginning ?

Change script names

In preparation to start Milestone 3, please:

  • change the name of the current script from dictLearningSpark.py to r1dl.py (to give parity of the script with the original C++ version).
  • create a new script called r1dl_spark.py for the Milestone 3 work.

the STD::

Dear xiang :

Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?

if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());

Python coding convention improvements

There are many small syntactical fixes that need to be made for the code to be production-ready and open sourced. They include (but are not limited to):

  • Putting a space between each binary operator and its operands (e.g. x = 5, rather than x=5)
  • A space after every comma (e.g. var1, var2 instead of var1,var2)
  • Line continuations should be indented from their starting line, e.g.
import numpy as np
x = np.array([1, 2, 3,
    4, 5, 6, 7, 8, 9, 10],
    dtype = np.float) # Any continuation of one line should be indented
  • Indentation level should be 4 spaces (no tabs)
  • Each function should have an accompanying docstring, e.g.
def square(x):
    """
    Returns the square (x^2) of the input argument.

    Parameters
    ---------------
    x : float
        The base value to be squared.

    Returns
    ----------
    x * x : float
        The square of the input argument.
    """
    return x * x

copying a vector to vector

as I Know there is three way to copy a vector to another one in python:

1.v1=v2
2.v1[:]=v2 (it seems its faster than 1)
3.numpy.copyto(v2, v1, casting='same_kind', where=None)

I think the 3rd one would be the best but unfortunately It doent work , I mean afetr testing still the value of v2 is the initial value and not the copied one(python 3.5).
should I use the 2nd method ?

Improve descriptions of variables in docstring

The descriptions of the parameters to op_selectTopR and op_getResidual are not very descriptive, e.g. vct_input: indicating input vector. Please improve these descriptions to provide a new user with intuition for exactly what the parameter is, how it is used, and what its larger role is in the overall program.

push throwing errors

Please post the errors you're getting that are preventing you from pushing to github.

out put files

for output files I already used the np.savetext with xiang's format as '%.50If\t" , I think this is the best option :)
np.savetxt(file_D, D, fmt='%.50lf\t')
np.savetxt(file_Z, Z, fmt='%.50lf\t')

Performance analysis

Test scalability of Spark implementation on a cluster (LJ cluster at UGA; AWS EC2 clusters using the thunder-ec2 scripts). Identify bottlenecks in the code.

Remove unused imports

There are some import statements that are never used: sys, StringIO, math, and random in particular. Possibly others.

Normalization functions

Dear Dr. quinn:
For normalization functions , It seems that the functions are mostly heuristic and designed based on experiences to be fit with this problem. Thus it's not possible to find an exact function equivalent with this functions in numpy or scipy . Therefore I Think I should convert line by line the normalization functions of xiang's code . for example I wrote the following one for "stat_normalize2l2NormVCT" as :

import numpy as np
vct_input = np.array([0,1,2,5,0],dtype=float)
T=5
double_l2norm = 0
for t in range(T):
double_l2norm = vct_input[t]*vct_input[t] + double_l2norm
print(vct_input[t])
double_l2norm = np.sqrt(double_l2norm)

for t in range (T):
vct_input[t] = vct_input[t]/double_l2norm
print(vct_input)
===================={ out put}==========
0.0
1.0
2.0
5.0
0.0
[ 0. 0.18257419 0.36514837 0.91287093 0. ]
[Finished in 0.3s]

Replace all tab indentations with 4 spaces

Each indented line is using a tab \t character; these should be replaced with 4 space characters per indentation level. Sublime should tell you if the indentation is using a tab character (looks like a straight horizontal line when highlighted) or spaces (a series of dots); make sure it's the latter.

importing the file and inferring the P and T

Can we use following codes to infer P and T :

import argparse
from pyspark import SparkContext, SparkConf
from pyspark.mlib.linalg.distributed import RowMatrix
.
.
.
.
    S = sc.textFile("file_s")
    y = RowMatrix(S)
    T = y.numRows()
    P = y.numCols()

Comparison scripts

For a given input:

  • Run the Python (not Spark) script on the input to compute the output OR have a pre-made Z.txt file
  • Generate output with the Spark version
  • Compute statistics of the two outputs: how different are they? Where are they different?

This will help identify discrepancies between the two implementations.

P1: Code stubs for parallel matrix-vector multiplication

We would need the code for parallel matrix-vector multiplication and the corresponding test cases to see whether we could get a performance boost from doing so, which would support the development of the parallel version of r1DL and sccDL.

Nature of input matrices

Xiang,

I'm having a difficult time understanding what the nature of the input data to the Spark implementation should be. Until now we've been testing with smaller datasets that are "tall and skinny", i.e. a large number of rows but a small number of columns. When I wrote the pseudocode for the method on this repo's wiki, it was my assumption that T >>> P, where T is the number of rows in S and P is the number of columns.

However, in the larger datasets (including the MOTOR dataset), the number of columns is significantly larger, hence my thought that the data needed to be transposed. But it seems like all the data in MOTOR is that way: "short-and-wide", or the number of rows is very small relative to the number of columns.

This presents some problems with the current implementation, since we distributed the data by rows. Since the data are dense, that means many fewer nodes, each with very large, very dense vectors. We'll need to rethink the implementation to take advantage of the short-and-wide input structure IF this is the case.

So we need some clarifications on the nature of the input data.

Test Results

@LindberghLi
Dear Xiang the test results for test 1 is as follows, would you please check it and let me the results of your consideration ?

z = [[-0.27229654 0.19700459 -0.19796643 -0.30472146 0.13367598]
[-0.22809997 -0.26494267 -0.19049078 -0.32886685 0.16040115]
[-0.17420122 -0.23350887 -0.19283139 -0.37827604 -0.20294629]
[ 0.17752543 -0.20941425 -0.15429241 -0.40064487 -0.22920292]
[-0.11979997 -0.22540733 -0.17153496 -0.40088513 -0.2192995 ]]

D = [[-0.10122219 -0.10122219 -0.10122219 -0.10122219 -0.10122219]
[ 0.1053534 0.1053534 0.1053534 0.1053534 0.1053534 ]
[ 0.0181571 0.0181571 0.0181571 0.0181571 0.0181571 ]
[ 0.14481954 0.14481954 0.14481954 0.14481954 0.14481954]
[-0.12595233 -0.12595233 -0.12595233 -0.12595233 -0.12595233]]

Fix use of spaces between characters and at ends of lines

There are many lines in the file that have trailing spaces (extra space characters at the end of a line), as well as spaces between characters that should be deleted (between an opening parenthesis ( and the argument of a function).

For example:

def func( arg1, arg2 ):

has extra spaces between the parentheses and the argument names. The correct version should be:

def func(arg1, arg2):

There should, however, be spaces between operators (such as +, -, *, /, =, and ,).

For example:

print('Analyzing component ',(m+1),'...')

is incorrect, as there are no spaces between the operator + and the operands, or any spaces after the commas. The correct version should be:

print('Analyzing component ', (m + 1), '...')

Select top R

@LindberghLi:

I'm helping @MOJTABAFA with unit testing, and in particular the selection of the top R elements seems to be critical. In checking that our function works correctly, I've been referencing the C++ implementation (line 115) and I have a question.

Specifically, if nth_element provides a partial sorting of the array wherein the larger elements are on the left of the nth element, and the smaller elements are to the right, why then is the loop over all N elements on line 122 needed? Surely you only need to loop over the first R elements (guaranteed to evaluate to true in the if statement on line 124). Or am I missing something?

Remove unused variables

There are some variables that are defined but never used. Particular examples include RAND_MAX, file_summary, and totoalResidual.

Also, I believe totoalResidual is misspelled. But even its correctly-spelled version is never used, so the variable should either be used somewhere or deleted entirely.

Running and debugging the Python Code

@LindberghLi
For running and debugging the python code , I need following information to make sure about the functionality of the program, after that we can work on benchmarking of this program and the C++ version one :

  1. Input file (already you gave me)
  2. The desired output file for that specific input ( for checking if the program works properly.
  3. Other items values like epsilon, T, D, M, P, percentage of non zero elements and etc.,

Please provide me this info as soon as you can .

Improve documentation and style

Let's use good documentation practices and coding style from the start. In particular,

  • Rename code1.py to be something representative of what the script is doing.
  • Provide comments in the code describing the operations.

rand_vct function

@magsol

Based on today negotiation with Xiang , It seems that our reandvct function could be easily translated to a line instruction in python. Today I found that the reason why xiang used the RAND_MAX is only to normal the random number between(-1,1), and the RAND_MAX gives us the maximum possible random.However in python we have a random generator which gives us a number between 0 and 1 so we don't need the RAND_MAX . So, do you have any idea about how to change the "stat_randVCT"
thanks.

the STD::

Dear xiang :

Thanks for your prev. comment , Now I found that in function op_selectTopR we're trying to create a new vector in which the value of elements are greater than N
Am I right ?

if yes, would you please let me know whats the role of following instruction?
std::nth_element(tmp.begin(), tmp.begin()+R, tmp.end(), std::greater());

Configure SublimeLinter plugin with PEP exceptions

"pep8": {
                "@disable": false,
                "args": [],
                "excludes": [],
                "ignore": "E302,E251,E501,E701,E128,W391,E265",
                "max-line-length": null,
                "select": ""

This goes in the Preferences -> Package Settings -> SublimeLinter -> User Settings file. This will eliminate some warnings from your IDE.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.