Giter Site home page Giter Site logo

alexmontgomerie / fpgaconvnet-optimiser Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 2.0 8.27 MB

Optimiser for mapping convolutional neural network models to FPGA platforms.

Home Page: https://fpgaconvnet.com

License: GNU General Public License v3.0

Python 99.43% Shell 0.57%
accelerator deep-learning fpga optimization

fpgaconvnet-optimiser's Introduction

fpgaconvnet-optimiser's People

Contributors

alexmontgomerie avatar biggsbenjamin avatar ptoupas avatar yu-zhewen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fpgaconvnet-optimiser's Issues

Introducing a multiFPGA toggle

The validation needs to support both single FPGA and multi FPGA distributions, so one way to do this is to make a switch to toggle between the modes. Areas of the code affected by this would largely be located in the models/network folder:

  • validate.py
    • In validate.py there are two ways of going about implementing the switch. I would say the cleanest would be to make a general function head of check_ and then call the current ones platform_check_ and the multi implementations cluster_check_. The general check would look like this:
def check_ports():
      if cluster_nplatform:
            return cluster_check_ports()
      else: 
            return platform_check_ports()
  • transforms.py
    • New transforms will be added to support cluster splitting. A separate issue will be raised on this. They should be toggleable
  • Optimiser Class
    • There will be some changes to the optimiser to clean it up a bit

DSP efficiency

Similar to #70, it would be interesting to quote the percentage of time that DSPs are used during execution. This is a commonly quoted metric.

Units for configuration files and outputs

At the moment, the units for values in the input configuration files and in the reports are not given. This has lead to some confusion. It would be best if we can supply the units to the optimiser, and for the units for reports to be given, or at least added to the documentation.

Have multiple platforms

For the multi-fpga system, the optimiser needs to be aware that there is more than one platform to run on. Currently, all partitions are mapped to a single hardware platform.

  • express multiple platforms at the Network level
  • express constraints on connectivity between platforms
  • reflect the bandwidth constraints on the platforms in the performance measures

ONNX Optimiser bug for Pytorch models

I have tried feeding an onnx model exported from pytorch into fpgaconvnet-optimiser and get an index error stating some of the inputs to the model are undefined. I've had a look on the ONNX git issues and 2901 seemed to be the fix but this is currently not in a released version of onnx so I want to try adding their workarounds here for the time being.

This means calling shape inference on the imported model before it is used and possibly adding some information to the onnx graph.

Edit: it could be an issue with the different env I had to use but my onnx install is unchanged so I don't think this is the case.

Parallel layers dimension inference

In tools.parser the parser has to infer the dimensions of each layer based on the input dimensions. This is in the add_dimensions function. I'm not sure if parallel layers are currently supported for dimension inference. It might be worth checking out if there's onnx tools that support this, possibly in the onnxruntime library

EE layers: finish layers

Layers:

  • ExitSelectionLayer
  • BufferLayer
  • ExitConditionLayer
  • SplitLayer

Tasks:

  • Link the constant input value from Greater op to the ExitConditionLayer in the same way as Conv and InnerProduct.

Cluster json for representing multiFPGA systems

Linked to #13
Proposal for how to structure the network level: Have a new JSON structure called a cluster. Each element in the structure has the 4 following fields:

{      
"platform":"zedboard",
"id":1,
"out_connection":[2,3],
"in_connection":[0]
}

The code would then have to read the platform JSON for each entry in the cluster to find the specification for that platform. This means that for single platform code there are no changes to the representation JSON, and that most changes related to this part of the multiFPGA support will be additive.

EE modules for ExitConditionLayer

For the ExitConditionLayer:

  • Softmax module
  • Reducemax - module that finds the maximum value of tensor input
  • Greater - module that compares input value to some stored threshold constant

The Greater module also needs to have a control interface with some sort of stalling/handshake signal to drive the buffers and exit selections. Could be as simple as a valid signal.

partition.py - get_all_horizontal_merges - type mismatch in conditional - line 80

Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml

Error
Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 72, in run_optimiser
self.apply_transform(transform)
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/optimiser.py", line 93, in apply_transform
self.apply_random_partition(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 271, in apply_random_partition
horizontal_merges = self.get_all_horizontal_merges(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 84, in get_all_horizontal_merges
if self.graph.in_degree(output_node) > 1:
TypeError: '>' not supported between instances of 'InDegreeView' and 'int'

I added this before the erroring line:

print(type(output_node)) print(output_node) print(type(self.graph.in_degree(output_node))) print(self.graph.in_degree(output_node))

result:
<class 'str'>
vgg0_relu12_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu3_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu0_fwd
<class 'int'>
1
<class 'str'>
vgg0_conv4_fwd
<class 'int'>
1
<class 'str'>
squeeze_vgg0_conv7_fwd
<class 'networkx.classes.reportviews.InDegreeView'>
[]

I looked up in_degree() and it says it should only return an int or an iterator so I'm not really sure what the type it's returning is.

Merge FIFO BRAM models

Both me and Zhewen have created different BRAM models for FIFOs, as well as memory. We should merge these to get the most accurate model.

Spatial Separable Convolutions

Spatial separable convolution (such as 1x3 and 3x1) is often used in low rank approximation.
todo: make kernel size, stride and padding 2d parameters

Standardize frequency specification

In the repo there are several variables representing frequency. The two most used are the clock frequency and the memory( and soon communication) bandwidth. Currently the frequency from the specification is left unused, while the memory bandwidth is imported. See lines

#self.platform['port_width'] = int(platform['port_width'])
#self.platform['freq'] = int(platform['freq'])
self.platform['reconf_time'] = float(platform['reconf_time'])
self.platform['mem_capacity'] = int(platform['mem_capacity'])

Since the platform frequency is left unimported the frequency of the design is set by the default value of the Network.py class. This should not be the case. It is also set in MHz, rather than the Hz unit used in the platform specification JSON. The memory bandwidth is specified in Gb/s, which is weird as it is specified right underneath the frequency specification which also is a frequency variable.

My proposal is that all frequencies and bandwidth are quoted in the fundamental units as raw numbers. So all frequencies are quoted in Hz, all times in seconds, etc. This would make converting between units more straightforward.

EE parser: update ctrledges list to graph or class

The control edges for the branching networks are currently stored in a list of lists when being parsed. This will likely be confusing down the line leading to errors.

Two options:

  1. Turn the ctrledges into a networkx graph.
  2. Create a class for the Control Flow Graph - add functions that enable searching through nodes.
  3. Combine both of these somehow.

auxiliary.py - remove_squeeze - get_output_nodes seems to return an empty list

Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml

error

Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 73, in run_optimiser
self.update_partitions()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/network/update.py", line 18, in update_partitions
self.partitions[partition_index].remove_squeeze()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/partition/auxiliary.py", line 85, in remove_squeeze
output_node = graphs.get_output_nodes(self.graph)[0]
IndexError: list index out of range

Testing

There needs to be some level of integration testing in order to make sure that the various projects work harmoniously. Therefore, unit tests need to be in place. We should have unit tests for the following components:

  • matrix module
  • parser module
    • should check that we are able to parse the networks we wish to target
  • graphs module
    • ensure that all graph functions work as expected
  • transforms module
  • models
    • layers
    • modules

Add inter-FPGA Communication layer

I think this will be the easiest way to integrate inter-FPGA communication into the current architecture.

The communication layer would use a communication module modeled on the Aurora IP. The module should also be able to use other IPs for communication. The Aurora model will need to capture the following information at least:

Communication Layer

The communication layer will need to support gathering streams into one transmission module. Likewise, it will need to be able to fork out the streams on the receiving end. There is no need to stream data in a particular order that I can think of, so a sliding window wont be necessary
Proposed modules in layer:

Module Description Position
Communication Sends/Receives information End/Front of layer
Fork Splits data stream after receive block or before send block Right before/after communication
Merge Merges data streams Right before/after communication

Parameters

In addition to the base layer parameters, we also need

Parameter Description
Rate_in The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer
Rate_out The data rate to the next layer/ FPGA depending on the configuration of the layer
Send_nreceive Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving
Pair_ID This ID will have to be unique and shared between two layers that are meant to be adjacent

Parameters for Communication Module

Parameter Description
Rate_in The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer
Rate_out The data rate to the next layer/ FPGA depending on the configuration of the layer
Send_nreceive Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving
Pair_ID This ID will have to be unique and shared between two layers that are meant to be adjacent
Communication port To help software with routing

Add support for clusters in scheduler

The scheduler workflow only supports writing partitions to a single FPGA in order. When using multiple FPGA platforms in a cluster the current scheduler wont have any way of knowing which platform to write to.

Multiple FPGAs opens up several new possibilities for scheduling the FPGAs. Is it better to update them sequentially? In bulk? This might be interesting to investigate for the project. For now adding support on the optimiser side should be sufficient to get started. By adding the three fields below there should be enough information for the HLS side of FPGAConvNet to know which FPGA it is meant to deploy what partition, and when.

Proposal:

platform ID: tells scheduler which platform to update with the partition in question
partition_group: tells scheduler if other platforms are meant to be update together with this platform.
scheduling_mode: to select between individual or grouped reconfiguration

Report estimated BRAM utilisation efficiency

Right now we have the estimated number of BRAMs that we expect the hardware to use, but what would be an interesting metric would be the BRAM utilisation efficiency. And what I mean by this is how much of the memory space of the BRAMs is used. For example, if a stream has a depth of 1000 and uses 1 BRAM with depth of 2000, then we would have 50% BRAM utilisation efficiency.

In my opinion this would be an interesting metric to report, and compare to other designs.

Bias Module

From the optimiser side, we need a model that describes the bias module. This needs to be created then added into the Convolution and InnerProduct layer models.

Multi Input/Output ports for Layers

To support networks which are a bit more parallel, the Layer template class needs to support multiple input and output ports, and the layers derived from it must support this also. To do this, we need the following:

  • a ports_in and ports_out member variable
  • coarse_in and coarse_out indexable by port_index
  • streams_in and streams_out indexable by port_index
  • dimensions (rows_in/out, cols_in/out, channels_in/out) indexable by port_index
  • rate_in and rate_out indexable by port_index
  • workload_in and workload_out indexable by port_index
  • size_in and size_out indexable by port_index
  • performance metrics (latency, pipeline_depth and wait_depth) indexable by port_index

Layers to Update

The following layers need to be updated:

  • ConvolutionLayer
  • PoolingLayer
  • InnerProductLayer
  • ReLULayer
  • ConcatLayer
  • EltwiseLayer
  • SplitLayer
  • SqueezeLayer

Do Conv ops (ONNX) always have weights and biases?

I've found that some conv ops seem to be missing the 'bias' input.
I think the parser is currently expecting both weights and biases so this may break it?

The ONNX model is correct against the pytorch implementation so I'm assuming this isn't an issue with pytorch or onnx export.

Add Eltwise Layer

For networks such as ResNet, they rely on a Eltwise layer. This needs to be implemented.

  • Be able to support:
    • addition
    • subtraction
    • multiplication
  • functionality should be similar to pytorch.sum for the elementwise addition, for example

Module

  • Eltwise

Documentation

In an effort to make this project more user friendly, documentation should be in place. We will use pdoc3 to generate the documentation. Docstrings will follow the numpy format.

  • Models
    • Modules
    • Layers
    • Partitions
    • Network
  • transforms
    • coarse
    • fine
    • weights reloading
    • partition
  • tools
    • graphs
    • matrix
    • parser
  • optimiser

EE: Update visualiser for new layers

Layers:

  • BufferLayer
  • SplitLayer
  • ExitConditionLayer
  • ExitSelectionLayer

Additional files to fix:

  • Network.py
  • Partition.py
  • matrix.py

Tasks:

  • vis for full network
  • vis for subgraph partitions (not ONNX subgraphs)
  • Make the cluster naming for the partitions more useful
  • Add ctrl edges
  • Get rid of the random squeeze layer added on the end of the EC

Add SplitLayer

To support parallel block, we need to add a layer which can split a stream to several outputs. This is to simplify layers such as PoolingLayer and ConvolutionLayer, where it is simpler to design for a single port in and out.

  • create a layer that copies a single input stream to multiple output streams
  • support automatically placing and removing in the partition

Modules

  • Fork

Constraints on partition merge/split

We may want certain layers to stay in the same partition, which can help the optimiser make wise decisions. One possible way is introducing the concept of block of layers. The other way is adding some flags in the partition to allow/forbid the merge/split. It is also important to think about how to pass such constraints to the optimiser.

separate resource model coefficients from the model

The resource model coefficients are currently part of the repo (fpgaconvnet_optimiser/coefficients), however keeping them in the repo leads to large pull requests, as they change quite frequently as the resource models change. It would be good to organise these outside of the repo, and have a way of downloading them easily. Maybe when the resource models are changing less, we can move them back into the repo.

Balanced Rates Matrix

Currently the rates matrix shows us the rate of each layer independent of each other. In reality, the rate of previous layers affects the later layers and vice versa. A method needs to be created that balances the rates such that the input and output rate of two sequential layers match. This will be useful in getting more accurate performance estimations and so on.

EE modules: MUX

For the ExitSelectionLayer it would be useful to have a select/mux type module.

Adding ID to partitions

Adding explicit IDs to partitions allows for more clarity when matching partitions to platforms.

How is the uniqueness of nodes represented?

Is it the name? Say if two nodes in the network has the same function and same number of edges in and out? Or for example when looking for a vertical merge opportunity. How do we know that we are in fact looking at a vertical copy of the partition and not another partition that has the same output node and input?

Add Tiling Transform

One way to further exploit the parallelism from multiple FPGAs is to split the featuremap computation across different FPGA partitions. This can be introduced in the tiling transform.

  • create a transform for a partition
  • update a tiling factor (number of tiles) for a given partition
  • update layers and modules within the partition such that the tiling factor reflects the dimensions of each module/layer
  • update the performance measures with this tiling factor

Naming communication layers

Linked to #38 .

@AlexMontgomerie Should the names of the layers be indexed using the platform they point to or should it be represented by a pair_id?

Indexed by platform they point to:

image

Indexed by pair id.
image

Load Network Description

Currently the optimiser can produce a network description, however there is no method of loading an existing network description. This could be useful for design checkpointing and so on.

Adding layer placeholders for ops and extending parser

For the time being I'm going to work on parsing the graph - specifically the subgraphs which I think are ignored at the moment.
Placeholders for the following ops will be put in for the time being.

Operations:

  • ReduceMax
  • Greater
  • If (subgraph extraction)
  • Identity (this can probably be ignored as long as the split layer is functioning)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.