ucb-bar / cosa Goto Github PK

A scheduler for spatial DNN accelerators that generate high-performance schedules in one shot using mixed integer programming (MIP)

License: BSD 2-Clause "Simplified" License

Shell 0.11% Python 65.43% C++ 34.35% C 0.03% Makefile 0.08%

cosa's Introduction

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

CoSA is a scheduler for spatial DNN accelerators that generate high-performance schedules in one shot using mixed integer programming (MIP). For more details, please refer to:

CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a MIP problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot.

Our transaction-based NoC simulator for evaluating various mappings can be found here. It is developed in SystemC and Python and uses the HLS router design from Matchlib.

Installation

Obtain a Gurobi license (see here for instructions on obtaining one for free if you're an academic). You do not need to download or install Gurobi itself. Once you have a license, download and extract the Gurobi license manager, then run the grbgetkey executable, supplying your license key when required. If you select a non-default location for the license file, specify the location of the file using:

export GRB_LICENSE_FILE=/path/to/gurobi.lic

Timeloop (optional - can be skipped if you only want to run the scheduler, without Timeloop benchmarking): Please refer to the instructions in the Timeloop Tutorial to install Timeloop with Docker. To install from source code please, follow the instructions in Timeloop Github. The specific Timeloop version used for CoSA evaluation is commit 019f107.
Download and install CoSA:

pip install cosa-scheduler

To install manually:

git clone https://github.com/ucb-bar/cosa.git 
python -m venv $HOME/.venv/cosa
source $HOME/.venv/cosa/bin/activate
python -m pip install -U pip
python -m pip install -e cosa

Alternatively, if using poetry:

poetry init
poetry add cosa-scheduler
# run this instead for git version
# poetry add git+https://github.com/ucb-bar/cosa.git#main
poetry shell

Run CoSA

To run the sample schedule, simply run: cosa from the command line.

CoSA can be run with the following flags:

usage: cosa [-h] [-o OUTPUT_DIR] [-ap ARCH_PATH] [-mp MAPSPACE_PATH]
            [-pp PROB_PATH]

Run Configuration

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output Folder
  -ap ARCH_PATH, --arch_path ARCH_PATH
                        Hardware Architecture Path
  -mp MAPSPACE_PATH, --mapspace_path MAPSPACE_PATH
                        Mapspace Path
  -pp PROB_PATH, --prob_path PROB_PATH
                        Problem Dimension Path

CoSA Inputs and Outputs

CoSA takes problem dimension, architecture constraints, relation encoding constants as inputs and returns a mapping with tiling, temporal/spatial, and permutation solved to optimize the user defined objective.

def cosa(prob, arch, A, B, part_ratios, global_buf_idx, Z=None): 
    """Run CoSA to generate a mapping with tiling, temporal/spatial, and permutation determined. 
        We currently assume there is a global buffer 
    Args:
        prob: An object defines the layer dimension.
        arch: An object defines the hardware architecture dimension. 
        A: A 2d binary constant matrix that encodes the layer dimension to data tensor relationship.
            1 means related, 0 means unrelated
            Note that the R,S to the input tensor relation is specially handled in the formulation,
            and are specified to 0. 
        B: A 2d binary constant matrix that encodes the data tensor to memory level mapping. 
            It can be derived from the mapspace bypass pattern in Timeloop. 
            Note it is intended to be used for even mapping among different data tensors to different memory levels.
        part_ratios: A 2d array to represent the partition ratios of different data tensors in different memory buffers. 
        global_buf_idx: An index point to the global buffer. 
        Z: Similar to B, but intended for uneven mapping among different data tensors to different memory levels.
            It is a 3d binary constant matrix that encodes the data tensor to memory level mapping.

    Returns: 
        factor_config: A 2d array specifying the allocation decision for each prime factor.
        spatial_config: A 2d array specifying the temporal/spatial decisions for each prime factor.
        perm_config: A 2d array specifying the ordering of R,S,P,Q,C,K,N factors at each 
    """

cosa's People

Contributors

Stargazers

Watchers

Forkers

victor-jung charleshong3 wxbbuaa2011 vmiheer hemimin timingan anderscmathisen gdinh leeohzzz chengwenonline deepelixir piyumalranawaka zifengzhao andful zhongwujie mukullokhande99 cc-innocence

cosa's Issues

Simba accelerator architecture.

Hi, the Simba accelerator architecture in the 'simba.yaml' of your source code is inconsistent with that in Table V. of your paper?
Could you please explain this？Thanks!!!

Empty output folder

Dear researchers,
After following your instructions in the readme file, I don't get any mapping yaml file. The output_dir folder has multi-level sub-folders. But there is no file inside these folders.
The program prints "INFO:gurobipy.gurobipy:Optimal solution found" and displays no error message.

Timeloop command line and configuration used in the paper

Hi I was wondering if you can post the timeloop-mapper yaml and command line you used for the paper. I installed the timeloop as you suggested in the readme and using simba arch in the repository along with resnet workloads in configs. I am adding mapper config file to explicitly set num-threads but letting everything else be default. But mapper is able to use only 1 processors in many cases.
I was wondering if you can help with this.

Sample usage on actual model and NoC simulator

According to the paper, CoSA was tested on AlexNet, ResNet-50, ResNeXt-50 on the NoC simulator. Would be cool to see samples of it being used on these networks!

Does CoSA verify numerical results with software?

In the paper CoSA claims to have 8bit activations and weights. Are the quantization computation process and results on NoC cross-verified by software like PyTorch's quantization mode? I did not see nocsim do real computation but only count statistics. Am I correct?

Evaluation on GPU

Hi, in the experiment of "Evaluation on GPU", whether the performance results are evalutated by Timeloop or real NVIDIA K80 GPU?
If they are evalutated on real NVIDIA K80 GPU, how to conduct experiemnts.
Thanks!

How to caluculate the storage capacity consistent with the paper according to arch file

Through the simba.yaml file, I cannot find a way to calculate the size of storage hierarchy consistent with the original paper. For example, the paper reads "Weight Buffer 32KB/PE" in TABLE V. Could you give me an equation to calculate such size using the following parameters?

- name: WeightBuffer
entries: 16384
instances: 128
meshX: 16
word-bits: 8
block-size: 8
num-ports: 1
num-banks: 8

Output File Not Generated

When I run the command

osa -o /home/piyumal/cosa -ap /home/piyumal/cosa/cosa-main/src/cosa/configs/arch/simba.yaml -mp /home/piyumal/cosa/cosa-main/src/cosa/configs/mapspace/mapspace.yaml -pp /home/piyumal/cosa/cosa-main/src/cosa/configs/workloads/alexnet_graph/_outputs_input.4.yaml

The input files could be found here
https://drive.google.com/drive/folders/1ftM5taTAAT_avwIPrEtAa2Svy4-qCzuK?usp=sharing

I get the following output in the terminal:

Set parameter Username
INFO:gurobipy.gurobipy:Set parameter Username
Academic license - for non-commercial use only - expires 2024-06-28
INFO:gurobipy.gurobipy:Academic license - for non-commercial use only - expires 2024-06-28
Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
INFO:gurobipy.gurobipy:Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
Thread count: 2 physical cores, 4 logical processors, using up to 4 threads
INFO:gurobipy.gurobipy:Thread count: 2 physical cores, 4 logical processors, using up to 4 threads
Optimize a model with 839 rows, 1262 columns and 7781 nonzeros
INFO:gurobipy.gurobipy:Optimize a model with 839 rows, 1262 columns and 7781 nonzeros
Model fingerprint: 0x8e76fc88
INFO:gurobipy.gurobipy:Model fingerprint: 0x8e76fc88
Model has 1490 quadratic objective terms
INFO:gurobipy.gurobipy:Model has 1490 quadratic objective terms
Model has 4 quadratic constraints
INFO:gurobipy.gurobipy:Model has 4 quadratic constraints
Model has 1 general constraint
INFO:gurobipy.gurobipy:Model has 1 general constraint
Variable types: 6 continuous, 1256 integer (1188 binary)
INFO:gurobipy.gurobipy:Variable types: 6 continuous, 1256 integer (1188 binary)
Coefficient statistics:
INFO:gurobipy.gurobipy:Coefficient statistics:
Matrix range [1e+00, 2e+00]
INFO:gurobipy.gurobipy: Matrix range [1e+00, 2e+00]
QMatrix range [1e+00, 2e+00]
INFO:gurobipy.gurobipy: QMatrix range [1e+00, 2e+00]
QLMatrix range [1e+00, 7e+00]
INFO:gurobipy.gurobipy: QLMatrix range [1e+00, 7e+00]
Objective range [1e+00, 3e+01]
INFO:gurobipy.gurobipy: Objective range [1e+00, 3e+01]
QObjective range [3e-01, 5e+00]
INFO:gurobipy.gurobipy: QObjective range [3e-01, 5e+00]
Bounds range [1e+00, 1e+00]
INFO:gurobipy.gurobipy: Bounds range [1e+00, 1e+00]
RHS range [1e+00, 2e+01]
INFO:gurobipy.gurobipy: RHS range [1e+00, 2e+01]
QRHS range [1e+01, 1e+01]
INFO:gurobipy.gurobipy: QRHS range [1e+01, 1e+01]
Presolve removed 650 rows and 152 columns
INFO:gurobipy.gurobipy:Presolve removed 650 rows and 152 columns
Presolve time: 0.01s
INFO:gurobipy.gurobipy:Presolve time: 0.01s
Presolved: 4643 rows, 2594 columns, 15781 nonzeros
INFO:gurobipy.gurobipy:Presolved: 4643 rows, 2594 columns, 15781 nonzeros
Variable types: 2 continuous, 2592 integer (2591 binary)
INFO:gurobipy.gurobipy:Variable types: 2 continuous, 2592 integer (2591 binary)

INFO:gurobipy.gurobipy:
Root relaxation: objective 2.241741e+02, 193 iterations, 0.00 seconds (0.00 work units)
INFO:gurobipy.gurobipy:Root relaxation: objective 2.241741e+02, 193 iterations, 0.00 seconds (0.00 work units)

INFO:gurobipy.gurobipy:
0 0 224.17410 0 8 INFO:gurobipy.gurobipy: H 0 0 INFO:gurobipy.gurobipy:H 0 H 0 0 INFO:gurobipy.gurobipy:H 0 0 0 224.28806 INFO:gurobipy.gurobipy: H 0 0 INFO:gurobipy.gurobipy:H 0 H 0 0 INFO:gurobipy.gurobipy:H 0 H 0 0 INFO:gurobipy.gurobipy:H 0 0 0 224.32156 INFO:gurobipy.gurobipy: 0 0 224.33077 INFO:gurobipy.gurobipy: 0 0 224.33528 INFO:gurobipy.gurobipy: 0 0 224.37845 INFO:gurobipy.gurobipy: 0 0 224.66952 INFO:gurobipy.gurobipy: H 0 0 INFO:gurobipy.gurobipy:H 0 0 0 224.70836 INFO:gurobipy.gurobipy: 0 0 cutoff 0 INFO:gurobipy.gurobipy: - 224.17410 - - 0s
0 0 224.17410 0 8 - 224.17410 - - 0s
234.9708346 224.17410 4.59% - 0s
0 234.9708346 224.17410 4.59% - 0s
232.9582244 224.17410 3.77% - 0s
0 232.9582244 224.17410 3.77% - 0s
0 13 232.95822 224.28806 3.72% - 0s
0 0 224.28806 0 13 232.95822 224.28806 3.72% - 0s
230.3380479 224.28806 2.63% - 0s
0 230.3380479 224.28806 2.63% - 0s
227.3074593 224.28806 1.33% - 0s
0 227.3074593 224.28806 1.33% - 0s
225.5274472 224.28806 0.55% - 0s
0 225.5274472 224.28806 0.55% - 0s
0 21 225.52745 224.32156 0.53% - 0s
0 0 224.32156 0 21 225.52745 224.32156 0.53% - 0s
0 10 225.52745 224.33077 0.53% - 0s
0 0 224.33077 0 10 225.52745 224.33077 0.53% - 0s
0 11 225.52745 224.33528 0.53% - 0s
0 0 224.33528 0 11 225.52745 224.33528 0.53% - 0s
0 16 225.52745 224.37845 0.51% - 0s
0 0 224.37845 0 16 225.52745 224.37845 0.51% - 0s
0 6 225.52745 224.66952 0.38% - 0s
0 0 224.66952 0 6 225.52745 224.66952 0.38% - 0s
224.7522820 224.66952 0.04% - 0s
0 224.7522820 224.66952 0.04% - 0s
0 4 224.75228 224.70836 0.02% - 0s
0 0 224.70836 0 4 224.75228 224.70836 0.02% - 0s
224.75228 224.75228 0.00% - 0s
0 0 cutoff 0 224.75228 224.75228 0.00% - 0s

INFO:gurobipy.gurobipy:
Explored 1 nodes (763 simplex iterations) in 0.16 seconds (0.13 work units)
INFO:gurobipy.gurobipy:Explored 1 nodes (763 simplex iterations) in 0.16 seconds (0.13 work units)
Thread count was 4 (of 4 available processors)
INFO:gurobipy.gurobipy:Thread count was 4 (of 4 available processors)

INFO:gurobipy.gurobipy:
Solution count 9: 224.752 224.983 224.983 ... 234.971
INFO:gurobipy.gurobipy:Solution count 9: 224.752 224.983 224.983 ... 234.971

INFO:gurobipy.gurobipy:
Optimal solution found (tolerance 1.00e-04)
INFO:gurobipy.gurobipy:Optimal solution found (tolerance 1.00e-04)
Best objective 2.247522819817e+02, best bound 2.247522819817e+02, gap 0.0000%
INFO:gurobipy.gurobipy:Best objective 2.247522819817e+02, best bound 2.247522819817e+02, gap 0.0000%

Then in the output folder I have specified I get the following folder chain but it is empty at the end:
/cosa/output/simba/5_5_27_27_64_192_1_1_1_1_1/1_1_1_1_1_1_1-1_1_1_1_4_2_1-1_1_1_1_1_1_1-1_1_1_1_1_8_1-1_1_1_1_16_1_1-1_1_1_1_1_1_1+0123456-0123456-0123456-0123456-0123456-0123456

Any help to find the output mapping would be extreamly appreciated and a great help indeed.

Thank You &
Warm Regards,
Piyumal