Introducing a multiFPGA toggle

The validation needs to support both single FPGA and multi FPGA distributions, so one way to do this is to make a switch to toggle between the modes. Areas of the code affected by this would largely be located in the models/network folder:

validate.py
- In validate.py there are two ways of going about implementing the switch. I would say the cleanest would be to make a general function head of check_ and then call the current ones platform_check_ and the multi implementations cluster_check_. The general check would look like this:

def check_ports():
      if cluster_nplatform:
            return cluster_check_ports()
      else: 
            return platform_check_ports()

transforms.py
- New transforms will be added to support cluster splitting. A separate issue will be raised on this. They should be toggleable
Optimiser Class
- There will be some changes to the optimiser to clean it up a bit

DSP efficiency

Similar to #70, it would be interesting to quote the percentage of time that DSPs are used during execution. This is a commonly quoted metric.

Add data width as parameters of Network/Partition/Layer

Units for configuration files and outputs

At the moment, the units for values in the input configuration files and in the reports are not given. This has lead to some confusion. It would be best if we can supply the units to the optimiser, and for the units for reports to be given, or at least added to the documentation.

Explicitly express coarse_out at end of partition

HLS model currently only supports a maximum of {64 bits/data width} streams. This should be reflected in the model as an explicit constraint in the partitioning process.

Have multiple platforms

For the multi-fpga system, the optimiser needs to be aware that there is more than one platform to run on. Currently, all partitions are mapped to a single hardware platform.

express multiple platforms at the Network level
express constraints on connectivity between platforms
reflect the bandwidth constraints on the platforms in the performance measures

ONNX Optimiser bug for Pytorch models

I have tried feeding an onnx model exported from pytorch into fpgaconvnet-optimiser and get an index error stating some of the inputs to the model are undefined. I've had a look on the ONNX git issues and 2901 seemed to be the fix but this is currently not in a released version of onnx so I want to try adding their workarounds here for the time being.

This means calling shape inference on the imported model before it is used and possibly adding some information to the onnx graph.

Edit: it could be an issue with the different env I had to use but my onnx install is unchanged so I don't think this is the case.

Parallel layers dimension inference

In tools.parser the parser has to infer the dimensions of each layer based on the input dimensions. This is in the add_dimensions function. I'm not sure if parallel layers are currently supported for dimension inference. It might be worth checking out if there's onnx tools that support this, possibly in the onnxruntime library

EE layers: finish layers

Layers:

ExitSelectionLayer
BufferLayer
ExitConditionLayer
SplitLayer

Tasks:

Link the constant input value from Greater op to the ExitConditionLayer in the same way as Conv and InnerProduct.

Cluster json for representing multiFPGA systems

Linked to #13
Proposal for how to structure the network level: Have a new JSON structure called a cluster. Each element in the structure has the 4 following fields:

{      
"platform":"zedboard",
"id":1,
"out_connection":[2,3],
"in_connection":[0]
}

The code would then have to read the platform JSON for each entry in the cluster to find the specification for that platform. This means that for single platform code there are no changes to the representation JSON, and that most changes related to this part of the multiFPGA support will be additive.

Update resource/performance models for different data widths

EE modules for ExitConditionLayer

For the ExitConditionLayer:

Softmax module
Reducemax - module that finds the maximum value of tensor input
Greater - module that compares input value to some stored threshold constant

The Greater module also needs to have a control interface with some sort of stalling/handshake signal to drive the buffers and exit selections. Could be as simple as a valid signal.

partition.py - get_all_horizontal_merges - type mismatch in conditional - line 80

Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml

Error
Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 72, in run_optimiser
self.apply_transform(transform)
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/optimiser.py", line 93, in apply_transform
self.apply_random_partition(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 271, in apply_random_partition
horizontal_merges = self.get_all_horizontal_merges(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 84, in get_all_horizontal_merges
if self.graph.in_degree(output_node) > 1:
TypeError: '>' not supported between instances of 'InDegreeView' and 'int'

I added this before the erroring line:

print(type(output_node)) print(output_node) print(type(self.graph.in_degree(output_node))) print(self.graph.in_degree(output_node))

result:
<class 'str'>
vgg0_relu12_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu3_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu0_fwd
<class 'int'>
1
<class 'str'>
vgg0_conv4_fwd
<class 'int'>
1
<class 'str'>
squeeze_vgg0_conv7_fwd
<class 'networkx.classes.reportviews.InDegreeView'>
[]

I looked up in_degree() and it says it should only return an int or an iterator so I'm not really sure what the type it's returning is.

Merge FIFO BRAM models

Both me and Zhewen have created different BRAM models for FIFOs, as well as memory. We should merge these to get the most accurate model.

Spatial Separable Convolutions

Spatial separable convolution (such as 1x3 and 3x1) is often used in low rank approximation.
todo: make kernel size, stride and padding 2d parameters

get dev-multi-exit passing the CI

Import precision information with onnx annotations

Standardize frequency specification

In the repo there are several variables representing frequency. The two most used are the clock frequency and the memory( and soon communication) bandwidth. Currently the frequency from the specification is left unused, while the memory bandwidth is imported. See lines

fpgaconvnet-optimiser/fpgaconvnet_optimiser/models/network/update.py

Lines 68 to 71 in ab20735

    
           #self.platform['port_width']     = int(platform['port_width']) 
        
           #self.platform['freq']           = int(platform['freq']) 
        
           self.platform['reconf_time']    = float(platform['reconf_time']) 
        
           self.platform['mem_capacity']   = int(platform['mem_capacity'])

Since the platform frequency is left unimported the frequency of the design is set by the default value of the Network.py class. This should not be the case. It is also set in MHz, rather than the Hz unit used in the platform specification JSON. The memory bandwidth is specified in Gb/s, which is weird as it is specified right underneath the frequency specification which also is a frequency variable.

My proposal is that all frequencies and bandwidth are quoted in the fundamental units as raw numbers. So all frequencies are quoted in Hz, all times in seconds, etc. This would make converting between units more straightforward.

Convolution with multiple groups

Currently, the convolution containing multiple groups has to run sequentially.

EE parser: update ctrledges list to graph or class

The control edges for the branching networks are currently stored in a list of lists when being parsed. This will likely be confusing down the line leading to errors.

Two options:

Turn the ctrledges into a networkx graph.
Create a class for the Control Flow Graph - add functions that enable searching through nodes.
Combine both of these somehow.

auxiliary.py - remove_squeeze - get_output_nodes seems to return an empty list

Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml

error

Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 73, in run_optimiser
self.update_partitions()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/network/update.py", line 18, in update_partitions
self.partitions[partition_index].remove_squeeze()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/partition/auxiliary.py", line 85, in remove_squeeze
output_node = graphs.get_output_nodes(self.graph)[0]
IndexError: list index out of range

Testing

There needs to be some level of integration testing in order to make sure that the various projects work harmoniously. Therefore, unit tests need to be in place. We should have unit tests for the following components:

matrix module
parser module
- should check that we are able to parse the networks we wish to target
graphs module
- ensure that all graph functions work as expected
transforms module
models
- layers
- modules

Updating modules to match MIMO Layer changes

Updating the modules in models to reflect the changes in #7

Modules to update:

Add inter-FPGA Communication layer

I think this will be the easiest way to integrate inter-FPGA communication into the current architecture.

The communication layer would use a communication module modeled on the Aurora IP. The module should also be able to use other IPs for communication. The Aurora model will need to capture the following information at least:

Communication Layer

The communication layer will need to support gathering streams into one transmission module. Likewise, it will need to be able to fork out the streams on the receiving end. There is no need to stream data in a particular order that I can think of, so a sliding window wont be necessary
Proposed modules in layer:

Module	Description	Position
Communication	Sends/Receives information	End/Front of layer
Fork	Splits data stream after receive block or before send block	Right before/after communication
Merge	Merges data streams	Right before/after communication

Parameters

In addition to the base layer parameters, we also need

Parameter	Description
Rate_in	The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer
Rate_out	The data rate to the next layer/ FPGA depending on the configuration of the layer
Send_nreceive	Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving
Pair_ID	This ID will have to be unique and shared between two layers that are meant to be adjacent

Parameters for Communication Module

Parameter	Description
Rate_in	The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer
Rate_out	The data rate to the next layer/ FPGA depending on the configuration of the layer
Send_nreceive	Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving
Pair_ID	This ID will have to be unique and shared between two layers that are meant to be adjacent
Communication port	To help software with routing

Check resources implemented in Network but only used in Optimiser

Would like to move check_resources to the Optimiser for the sake of clarity, as it is only used in the Optimiser class, never by the Network class

Add support for clusters in scheduler

The scheduler workflow only supports writing partitions to a single FPGA in order. When using multiple FPGA platforms in a cluster the current scheduler wont have any way of knowing which platform to write to.

Multiple FPGAs opens up several new possibilities for scheduling the FPGAs. Is it better to update them sequentially? In bulk? This might be interesting to investigate for the project. For now adding support on the optimiser side should be sufficient to get started. By adding the three fields below there should be enough information for the HLS side of FPGAConvNet to know which FPGA it is meant to deploy what partition, and when.

Proposal:

platform ID: tells scheduler which platform to update with the partition in question
partition_group: tells scheduler if other platforms are meant to be update together with this platform.
scheduling_mode: to select between individual or grouped reconfiguration

Setup.py packaging issue with switch to python 3.7

I think the environment.yml is missing pydot and graphviz packages.

Report estimated BRAM utilisation efficiency

Right now we have the estimated number of BRAMs that we expect the hardware to use, but what would be an interesting metric would be the BRAM utilisation efficiency. And what I mean by this is how much of the memory space of the BRAMs is used. For example, if a stream has a depth of 1000 and uses 1 BRAM with depth of 2000, then we would have 50% BRAM utilisation efficiency.

In my opinion this would be an interesting metric to report, and compare to other designs.

Bias Module

From the optimiser side, we need a model that describes the bias module. This needs to be created then added into the Convolution and InnerProduct layer models.

Multi Input/Output ports for Layers

To support networks which are a bit more parallel, the Layer template class needs to support multiple input and output ports, and the layers derived from it must support this also. To do this, we need the following:

a ports_in and ports_out member variable
coarse_in and coarse_out indexable by port_index
streams_in and streams_out indexable by port_index
dimensions (rows_in/out, cols_in/out, channels_in/out) indexable by port_index
rate_in and rate_out indexable by port_index
workload_in and workload_out indexable by port_index
size_in and size_out indexable by port_index
performance metrics (latency, pipeline_depth and wait_depth) indexable by port_index

Layers to Update

The following layers need to be updated:

Do Conv ops (ONNX) always have weights and biases?

I've found that some conv ops seem to be missing the 'bias' input.
I think the parser is currently expecting both weights and biases so this may break it?

The ONNX model is correct against the pytorch implementation so I'm assuming this isn't an issue with pytorch or onnx export.

Extend transforms to support branching & Add branching specific transforms

Once I've worked the transforms that need to change and ones that need to be added I will split up this issue.

Work out transforms
Extend existing transforms
Add branch specific transforms

Add Eltwise Layer

For networks such as ResNet, they rely on a Eltwise layer. This needs to be implemented.

Be able to support:
- addition
- subtraction
- multiplication
functionality should be similar to pytorch.sum for the elementwise addition, for example

Module

Eltwise

Add auxiliary functions to add and remove communication layers

The communication layers are always located at the end or start of the partition and should point to the next partition following the links in the cluster definition.

Documentation

In an effort to make this project more user friendly, documentation should be in place. We will use pdoc3 to generate the documentation. Docstrings will follow the numpy format.

EE: Update visualiser for new layers

Layers:

BufferLayer
SplitLayer
ExitConditionLayer
ExitSelectionLayer

Additional files to fix:

Network.py
Partition.py
matrix.py

Tasks:

vis for full network
vis for subgraph partitions (not ONNX subgraphs)
Make the cluster naming for the partitions more useful
Add ctrl edges
Get rid of the random squeeze layer added on the end of the EC

Add SplitLayer

To support parallel block, we need to add a layer which can split a stream to several outputs. This is to simplify layers such as PoolingLayer and ConvolutionLayer, where it is simpler to design for a single port in and out.

create a layer that copies a single input stream to multiple output streams
support automatically placing and removing in the partition

Modules

Fork

Constraints on partition merge/split

We may want certain layers to stay in the same partition, which can help the optimiser make wise decisions. One possible way is introducing the concept of block of layers. The other way is adding some flags in the partition to allow/forbid the merge/split. It is also important to think about how to pass such constraints to the optimiser.

Add ConcatLayer

Concatenation layer in order to implement networks such as GoogleNet.

The functionality should be same as (or at least close to) pytorch.cat

Module

Concat

look at https://github.com/AlexMontgomerie/fpgaConvNet2 for the legacy implementation

separate resource model coefficients from the model

The resource model coefficients are currently part of the repo (fpgaconvnet_optimiser/coefficients), however keeping them in the repo leads to large pull requests, as they change quite frequently as the resource models change. It would be good to organise these outside of the repo, and have a way of downloading them easily. Maybe when the resource models are changing less, we can move them back into the repo.

Balanced Rates Matrix

Currently the rates matrix shows us the rate of each layer independent of each other. In reality, the rate of previous layers affects the later layers and vice versa. A method needs to be created that balances the rates such that the input and output rate of two sequential layers match. This will be useful in getting more accurate performance estimations and so on.

EE modules: MUX

For the ExitSelectionLayer it would be useful to have a select/mux type module.

EE: Make node parsing recursive in build_graph()

Adding ID to partitions

Adding explicit IDs to partitions allows for more clarity when matching partitions to platforms.

How is the uniqueness of nodes represented?

Is it the name? Say if two nodes in the network has the same function and same number of edges in and out? Or for example when looking for a vertical merge opportunity. How do we know that we are in fact looking at a vertical copy of the partition and not another partition that has the same output node and input?

Support MatMul onnx op conversion to innerproduct in parser

MatMul operation generated by some pytorch onnx conversion instead of GEMM operation.

Add this to the parser.

Add Tiling Transform

One way to further exploit the parallelism from multiple FPGAs is to split the featuremap computation across different FPGA partitions. This can be introduced in the tiling transform.

create a transform for a partition
update a tiling factor (number of tiles) for a given partition
update layers and modules within the partition such that the tiling factor reflects the dimensions of each module/layer
update the performance measures with this tiling factor

Naming communication layers

Linked to #38 .

@AlexMontgomerie Should the names of the layers be indexed using the platform they point to or should it be represented by a pair_id?

Indexed by platform they point to:

Indexed by pair id.

Load Network Description

Currently the optimiser can produce a network description, however there is no method of loading an existing network description. This could be useful for design checkpointing and so on.

Adding layer placeholders for ops and extending parser

For the time being I'm going to work on parsing the graph - specifically the subgraphs which I think are ignored at the moment.
Placeholders for the following ops will be put in for the time being.

Operations:

ReduceMax
Greater
If (subgraph extraction)
Identity (this can probably be ignored as long as the split layer is functioning)

	#self.platform['port_width'] = int(platform['port_width'])
	#self.platform['freq'] = int(platform['freq'])
	self.platform['reconf_time'] = float(platform['reconf_time'])
	self.platform['mem_capacity'] = int(platform['mem_capacity'])

alexmontgomerie / fpgaconvnet-optimiser Goto Github PK

fpgaconvnet-optimiser's Introduction

fpgaconvnet-optimiser's People

Contributors

Stargazers

Watchers

fpgaconvnet-optimiser's Issues

Modules to update:

Communication Layer

Parameters

Parameters for Communication Module

Proposal:

Layers to Update

Module

Layers:

Additional files to fix:

Tasks:

Modules

Module

Operations:

Recommend Projects

Recommend Topics

Recommend Org