Giter Site home page Giter Site logo

crosscompare's Introduction

CrossCompare

Cross-compare script with rewritten algorithm. Able to execute in more than one process.

What is CrossCompare?

CrossCompare is a python script originally written to empirically search datasets for optimally orthogonal submatricies. Data is read in from .csv files and can be processed on multiple threads using python's multiprocessing module. For large datasets or submatrix dimensions (>2) it is advisable to use a supercomputing cluster to run this algorithm.

Dependencies

The following must be installed in order for the code to run properly:

For easy installation and many more useful python modules try the Anaconda installer.

Running the script

Formatting the input:
Data must be formatted as a .csv file with row 1 as column headings, and column 1 as row indices. Each entry in the table must be a number, or may be blank. Strings will break the code. Place substrates in the columns of the A small table might look like this:

Substrate 1 Substrate 2 Substrate 3 ... Substrate n
Enzyme 1 3.4 5.7 10 ... ...
Enzyme 2 67 8.9 11 ... ...
Enzyme 3 4.8 10E10 5 ... ...
... ... ... ... ... ...
Enzyme m ... ... ... ... value for substrate n in enzyme m

Running your .csv:
After forking, cloning, or downloading the repository, navigate to the folder containing the source files using the terminal.

On mac/linux: cd path/to/CrossCompare

A basic run can be achieved by specifying only the input and output filenames with the -i and -o tags respectively.

On mac/linux: python3 run_OSF.py -i "name of your.csv" -o "name of your desired output.csv"

The script will print what it is doing into the terminal and create a .csv in the current path with the specified file name.

To keep track of your runs, and to minimize typing in the terminal, I also added JSON parsing functionality. Instead of all those pesky commands and flags, just specify the path to your config file that contains all the necessary information.

python3 run_OSF.py -c config.json

See the config.json file for how to format. If you want to test if things are working, the config file included should run the sample data.

Other options for customizing the script output:
[-d DIMENSION] Optional tag to specify the number of dimensions to use in the search. The default is 2. Using greater than 2 dimensions on large (>500 entry) datasets could take a considerable amount of time.

[-p PROCESSES] Number of child processes to split the iteration into. If a personal computer is being used, putting the number of available processors here will give the speediest result. The default is 1.

[-l LENGTH] Length of the list to be outputted. Putting -l 2000 will give the top 2000 ranked submatricies. Default: 1000.

[-t THRESHOLD] RMS threshold to keep for the sorted list. Specifying this tag is often necessary for larger searches (>2 dimensions or larger datasets) to reduce memory usage. Setting this to 0.15 reduced our memory usage adequately. Default: 1.

Optimizing Memory Use: [-b BUFFER_LENGTH] Only important with large datasets. Use to adjust the length of the list that is split among the processors. If using many processors, increase this number. If low on RAM, decrease this number. A value of 1,000,000 works well with 16 processors. Default: 1000000.

04/2018: It looks like the BUFFER_LENGTH parameter still does not solve the memory issue because results pile up in a single process as the script runs; with enough result, things will crash. I'm still working on a solution to this, but in the mean time, we are using the subsample.py script to get subsamples of our matricies to make the computation a bit shorter.

Example input:
For a 3 dimensional run with a file called dataset.csv, on a computer with 8 cores, one may use:

python3 run_OSF.py -i "dataset.csv" -o "dataset_out.csv" -d 3 -t 0.1 -p 8

crosscompare's People

Contributors

c-bun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.