DSE-512 Playground

About
Getting Started
- Prerequisites
Installing, Testing, Building
Running locally
Deployment
Continuous Ιntegration
Todo
Built With
License
Acknowledgments

About

A playground repo for the DSE-512 course.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

You need to have a machine with Python > 3.6 and any Bash based shell (e.g. zsh) installed.

$ python3.8 -V
Python 3.8.5

$ echo $SHELL
/usr/bin/zsh

You will also need to install MPI in your system. Ref for Ubuntu

Installing, Testing, Building

All the installation steps are being handled by the Makefile. The server=local flag basically specifies that you want to use conda instead of venv, and it can be changed easily in the lines #25-28. local is also the default flag, so you can omit it.

If you don't want to go through the detailed setup steps but finish the installation and run the tests quickly, execute the following command:

$ make install server=local

If you executed the previous command, you can skip through to the Running locally section.

Check the available make commands

$ make help
-----------------------------------------------------------------------------------------------------------
                                              DISPLAYING HELP                                              
-----------------------------------------------------------------------------------------------------------
Use make <make recipe> [server=<prod|circleci|local>] to specify the server
Prod, and local are using conda env, circleci uses virtualenv. Default: local

make help
       Display this message
make install [server=<prod|circleci|local>]
       Call clean delete_conda_env create_conda_env setup run_tests
make clean [server=<prod|circleci|local>]
       Delete all './build ./dist ./*.pyc ./*.tgz ./*.egg-info' files
make delete_env [server=<prod|circleci|local>]
       Delete the current conda env or virtualenv
make create_env [server=<prod|circleci|local>]
       Create a new conda env or virtualenv for the specified python version
make setup [server=<prod|circleci|local>]
       Call setup.py install
make run_tests [server=<prod|circleci|local>]
       Run all the tests from the specified folder
-----------------------------------------------------------------------------------------------------------

Clean any previous builds

$ make clean delete_env server=local

Create a new virtual environment

For creating a conda virtual environment run:

$ make create_env server=local

Build Locally (and install requirements)

To build the project locally using the setup.py install command (which also installs the requirements), execute the following command:

$ make setup server=local

Run the tests

The tests are located in the tests folder. To run all of them, execute the following command:

$ make run_tests server=local

Running the code locally

In order to run the code, you will only need to change the yml file if you need to, and either run its file directly or invoke its console script.

If you don't need to change yml file, skip to Execution Options.

Modifying the Configuration

There is an already configured yml file under confs/template_conf.yml with the following structure:

tag: template example_db: - config: hostname: example.host.name username: my_name password: !ENV ${PASS} db_name: my_db1 port: 3306 type: mysql

The !ENV flag indicates that you are passing an environmental value to this attribute. You can change the values/environmental var names as you wish. If a yaml variable name is changed/added/deleted, the corresponding changes should be reflected on the yml_schema.json too which validates it.

Set the required environment variables

In order to run the main.py you will need to set the environmental variables you are using in your configuration yml file. Example:

$ export PASS=my_password

The best way to do that, is to create a .env file (example), and source it before running the code.

Execution Options

First, make sure you are in the correct virtual environment:

$ conda activate dse512_playground $ which python /home/drkostas/anaconda3/envs/dse512_playground/bin/python

DSE-playground Main

Now, in order to run the code you can either call the main.py directly, or invoke the playground_main console script.

$ python playground/main.py --help usage: main.py -c CONFIG_FILE [-m {run_mode_1,run_mode_2,run_mode_3}] [-l LOG] [-d] [-h] A template for python projects. Required Arguments: -c CONFIG_FILE, --config-file CONFIG_FILE The configuration yml file Optional Arguments: -m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3} Description of the run modes -l LOG, --log LOG Name of the output log file -d, --debug Enables the debug log messages -h, --help Show this help message and exit # Or $ playground_main --help usage: main.py -c CONFIG_FILE [-m {run_mode_1,run_mode_2,run_mode_3}] [-l LOG] [-d] [-h] A template for python projects. Required Arguments: -c CONFIG_FILE, --config-file CONFIG_FILE The configuration yml file Optional Arguments: -m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3} Description of the run modes -l LOG, --log LOG Name of the output log file -d, --debug Enables the debug log messages -h, --help Show this help message and exit

DSE-playground CLI

There is also a cli.py which you can also invoke it by its console script too (cli).

$ cli --help Usage: cli [OPTIONS] COMMAND [ARGS]... Options: --install-completion [bash|zsh|fish|powershell|pwsh] Install completion for the specified shell. --show-completion [bash|zsh|fish|powershell|pwsh] Show completion for the specified shell, to copy it or customize the installation. --help Show this message and exit. Commands: bye hello

Deployment

The deployment is being done to Heroku. For more information you can check the setup guide.

Make sure you check the defined Procfile (reference) and that you set the above-mentioned environmental variables (reference) .

Continuous Integration

$ which python /home/drkostas/anaconda3/envs/DSE512-playground/bin/python (DSE512-playground)

### Execution Options <a name = "execution_options"></a> Depending on the file you want to run, you'll need to follow the corresponding instructions. To view them, just run: ```bash $ python <your file name>.py --help usage: <your file name>.py -m {run_mode_1,run_mode_2,run_mode_3} -c CONFIG_FILE [-l LOG] [-d] [-h] <Your python file\'s description. required arguments: -m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3} Description of the run modes -c CONFIG_FILE, --config-file CONFIG_FILE The configuration yml file -l LOG, --log LOG Name of the output log file optional arguments: -d, --debug enables the debug log messages

To run it following the instructions.

Assignment03

For this assignment, you will extend the code we created in class, located at /lustre/haven/proj/UTK0150/jhinkl13/kmeans.
The submission you return to us should be a brief report formatted in HTML, DOCX, or PDF.
For the report that you submit, you do not need to overly format it; you can simply list your responses
to each of the problems below.

Please do the following using number of clusters -k 4, using the TCGA dataset:

(25 points) Starting from the kmeans repository developed in class, which you extended in the last
two assignments, refactor the kmeans.py to contain the following subfunctions: compute_distances(),
expectation_step(), maximization_step(), called at the appropriate places inside the kmeans()
function. Ensure that the program still runs. Do this for kmeans_vectorized.py as well. In your
report, indicate you’ve completed problem 1 and provide the path to your code on ISAAC (or github if
you choose to use it).
(25 points) Profile your newly-refactored kmeans.py and report the time spent in each of the
three new functions both in seconds and as a percentage of the total runtime. Do the same
for kmeans_vectorized.py. Recall that kmeans_vectorized.py attempted to speed up the
compute_distances() portion. Use Amdahl’s Law to compute the theoretical maximum speedup
possible by optimizing and parallelizing the compute_distances(). What percentage of that speedup
have we actually obtained by vectorization with numpy?
(25 points) Visualize an icicle plot of the profiling output for kmeans.py. You may use Snake- Vis or Viztracer along with cprofiler, as Todd demonstrated in Lecture 15. Do the same for
kmeans_vectorized.py. You may include these as screenshots in your report; please rescale the figures
to ensure that we can see the main function names in these plots.
(25 points) Using the profiling output from Problem 2, determine the maximum speedup you could
obtain by optimizing the expectation_step() and maximization_step() (note that you may need
to refactor further to measure the runtime of the main kmeans loop). In a new file, kmeans_numba.py,
use the @numba.jit decorator (install and import numba in your code), re-profiling your code, and
compare your runtime to the ideal speedup given by applying Amdahl’s Law. Report the new profiled
runtimes for these three functions and the total runtime using Numba in your report.
Full run for the tcga dataset
Measure times & Amdahl's law, create report, gather results, screenshots and submit.

drkostas / data-science-methods Goto Github PK

data-science-methods's Introduction

DSE-512 Playground

Table of Contents

About

Getting Started

Prerequisites

Installing, Testing, Building

Check the available make commands

Clean any previous builds

Create a new virtual environment

Build Locally (and install requirements)

Run the tests

Running the code locally

Modifying the Configuration

Set the required environment variables

Execution Options

DSE-playground Main

DSE-playground CLI

Deployment

Continuous Integration

data-science-methods's People

Contributors

Stargazers

Watchers

data-science-methods's Issues

Question 2

Question 1

Recommend Projects

Recommend Topics

Recommend Org