alexmontgomerie / fpgaconvnet-optimiser Goto Github PK
View Code? Open in Web Editor NEWOptimiser for mapping convolutional neural network models to FPGA platforms.
Home Page: https://fpgaconvnet.com
License: GNU General Public License v3.0
Optimiser for mapping convolutional neural network models to FPGA platforms.
Home Page: https://fpgaconvnet.com
License: GNU General Public License v3.0
The validation needs to support both single FPGA and multi FPGA distributions, so one way to do this is to make a switch to toggle between the modes. Areas of the code affected by this would largely be located in the models/network folder:
check_
and then call the current ones platform_check_
and the multi implementations cluster_check_
. The general check would look like this:def check_ports():
if cluster_nplatform:
return cluster_check_ports()
else:
return platform_check_ports()
Similar to #70, it would be interesting to quote the percentage of time that DSPs are used during execution. This is a commonly quoted metric.
At the moment, the units for values in the input configuration files and in the reports are not given. This has lead to some confusion. It would be best if we can supply the units to the optimiser, and for the units for reports to be given, or at least added to the documentation.
HLS model currently only supports a maximum of {64 bits/data width} streams. This should be reflected in the model as an explicit constraint in the partitioning process.
For the multi-fpga system, the optimiser needs to be aware that there is more than one platform to run on. Currently, all partitions are mapped to a single hardware platform.
Network
levelI have tried feeding an onnx model exported from pytorch into fpgaconvnet-optimiser and get an index error stating some of the inputs to the model are undefined. I've had a look on the ONNX git issues and 2901 seemed to be the fix but this is currently not in a released version of onnx so I want to try adding their workarounds here for the time being.
This means calling shape inference on the imported model before it is used and possibly adding some information to the onnx graph.
Edit: it could be an issue with the different env I had to use but my onnx install is unchanged so I don't think this is the case.
In tools.parser
the parser has to infer the dimensions of each layer based on the input dimensions. This is in the add_dimensions
function. I'm not sure if parallel layers are currently supported for dimension inference. It might be worth checking out if there's onnx tools that support this, possibly in the onnxruntime
library
Layers:
Tasks:
Greater
op to the ExitConditionLayer in the same way as Conv and InnerProduct.Linked to #13
Proposal for how to structure the network level: Have a new JSON structure called a cluster. Each element in the structure has the 4 following fields:
{
"platform":"zedboard",
"id":1,
"out_connection":[2,3],
"in_connection":[0]
}
The code would then have to read the platform JSON for each entry in the cluster to find the specification for that platform. This means that for single platform code there are no changes to the representation JSON, and that most changes related to this part of the multiFPGA support will be additive.
For the ExitConditionLayer
:
The Greater
module also needs to have a control interface with some sort of stalling/handshake signal to drive the buffers and exit selections. Could be as simple as a valid signal.
Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml
Error
Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 72, in run_optimiser
self.apply_transform(transform)
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/optimiser.py", line 93, in apply_transform
self.apply_random_partition(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 271, in apply_random_partition
horizontal_merges = self.get_all_horizontal_merges(partition_index)
File "/home/benubu/phd/fpgaconvnet-optimiser/transforms/partition.py", line 84, in get_all_horizontal_merges
if self.graph.in_degree(output_node) > 1:
TypeError: '>' not supported between instances of 'InDegreeView' and 'int'
I added this before the erroring line:
print(type(output_node)) print(output_node) print(type(self.graph.in_degree(output_node))) print(self.graph.in_degree(output_node))
result:
<class 'str'>
vgg0_relu12_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu3_fwd
<class 'int'>
1
<class 'str'>
vgg0_relu0_fwd
<class 'int'>
1
<class 'str'>
vgg0_conv4_fwd
<class 'int'>
1
<class 'str'>
squeeze_vgg0_conv7_fwd
<class 'networkx.classes.reportviews.InDegreeView'>
[]
I looked up in_degree()
and it says it should only return an int or an iterator so I'm not really sure what the type it's returning is.
Both me and Zhewen have created different BRAM models for FIFOs, as well as memory. We should merge these to get the most accurate model.
Spatial separable convolution (such as 1x3 and 3x1) is often used in low rank approximation.
todo: make kernel size, stride and padding 2d parameters
In the repo there are several variables representing frequency. The two most used are the clock frequency and the memory( and soon communication) bandwidth. Currently the frequency from the specification is left unused, while the memory bandwidth is imported. See lines
fpgaconvnet-optimiser/fpgaconvnet_optimiser/models/network/update.py
Lines 68 to 71 in ab20735
Since the platform frequency is left unimported the frequency of the design is set by the default value of the Network.py
class. This should not be the case. It is also set in MHz, rather than the Hz unit used in the platform specification JSON. The memory bandwidth is specified in Gb/s, which is weird as it is specified right underneath the frequency specification which also is a frequency variable.
My proposal is that all frequencies and bandwidth are quoted in the fundamental units as raw numbers. So all frequencies are quoted in Hz, all times in seconds, etc. This would make converting between units more straightforward.
Currently, the convolution containing multiple groups has to run sequentially.
The control edges for the branching networks are currently stored in a list of lists when being parsed. This will likely be confusing down the line leading to errors.
Two options:
ctrledges
into a networkx graph.Command
python -m run_optimiser --name vgg16 --model_path examples/models/vgg16-bn-7.onnx --platform_path examples/platforms/xcvu9p.json --output_path outputs/vgg16 --batch_size 256 --objective throughput --transforms fine weights_reloading coarse partition --optimiser simulated_annealing --optimiser_config_path examples/optimiser_example.yml
error
Traceback (most recent call last):
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/benubu/miniconda3/envs/fpgaconvnet/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/benubu/phd/fpgaconvnet-optimiser/run_optimiser.py", line 96, in
net.run_optimiser()
File "/home/benubu/phd/fpgaconvnet-optimiser/optimiser/simulated_annealing.py", line 73, in run_optimiser
self.update_partitions()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/network/update.py", line 18, in update_partitions
self.partitions[partition_index].remove_squeeze()
File "/home/benubu/phd/fpgaconvnet-optimiser/models/partition/auxiliary.py", line 85, in remove_squeeze
output_node = graphs.get_output_nodes(self.graph)[0]
IndexError: list index out of range
There needs to be some level of integration testing in order to make sure that the various projects work harmoniously. Therefore, unit tests need to be in place. We should have unit tests for the following components:
matrix
moduleparser
module
graphs
module
transforms
modulemodels
layers
modules
Updating the modules in models to reflect the changes in #7
I think this will be the easiest way to integrate inter-FPGA communication into the current architecture.
The communication layer would use a communication module modeled on the Aurora IP. The module should also be able to use other IPs for communication. The Aurora model will need to capture the following information at least:
The communication layer will need to support gathering streams into one transmission module. Likewise, it will need to be able to fork out the streams on the receiving end. There is no need to stream data in a particular order that I can think of, so a sliding window wont be necessary
Proposed modules in layer:
Module | Description | Position |
---|---|---|
Communication | Sends/Receives information | End/Front of layer |
Fork | Splits data stream after receive block or before send block | Right before/after communication |
Merge | Merges data streams | Right before/after communication |
In addition to the base layer parameters, we also need
Parameter | Description |
---|---|
Rate_in | The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer |
Rate_out | The data rate to the next layer/ FPGA depending on the configuration of the layer |
Send_nreceive | Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving |
Pair_ID | This ID will have to be unique and shared between two layers that are meant to be adjacent |
Parameter | Description |
---|---|
Rate_in | The data rate of information coming from the FPGA network in a sending configuration, alternatively the rate from the previous communication layer |
Rate_out | The data rate to the next layer/ FPGA depending on the configuration of the layer |
Send_nreceive | Boolean to keep track of whether or not the layer is sending information over the inter-FPGA link or if it is receiving |
Pair_ID | This ID will have to be unique and shared between two layers that are meant to be adjacent |
Communication port | To help software with routing |
Would like to move check_resources
to the Optimiser
for the sake of clarity, as it is only used in the Optimiser
class, never by the Network
class
The scheduler workflow only supports writing partitions to a single FPGA in order. When using multiple FPGA platforms in a cluster the current scheduler wont have any way of knowing which platform to write to.
Multiple FPGAs opens up several new possibilities for scheduling the FPGAs. Is it better to update them sequentially? In bulk? This might be interesting to investigate for the project. For now adding support on the optimiser side should be sufficient to get started. By adding the three fields below there should be enough information for the HLS side of FPGAConvNet to know which FPGA it is meant to deploy what partition, and when.
platform ID
: tells scheduler which platform to update with the partition in question
partition_group
: tells scheduler if other platforms are meant to be update together with this platform.
scheduling_mode
: to select between individual or grouped reconfiguration
I think the environment.yml is missing pydot and graphviz packages.
Right now we have the estimated number of BRAMs that we expect the hardware to use, but what would be an interesting metric would be the BRAM utilisation efficiency. And what I mean by this is how much of the memory space of the BRAMs is used. For example, if a stream has a depth of 1000 and uses 1 BRAM with depth of 2000, then we would have 50% BRAM utilisation efficiency.
In my opinion this would be an interesting metric to report, and compare to other designs.
From the optimiser side, we need a model that describes the bias module. This needs to be created then added into the Convolution
and InnerProduct
layer models.
To support networks which are a bit more parallel, the Layer
template class needs to support multiple input and output ports, and the layers derived from it must support this also. To do this, we need the following:
ports_in
and ports_out
member variablecoarse_in
and coarse_out
indexable by port_index
streams_in
and streams_out
indexable by port_index
rows_in/out
, cols
_in/out, channels_in/out
) indexable by port_index
rate_in
and rate_out
indexable by port_index
workload_in
and workload_out
indexable by port_index
size_in
and size_out
indexable by port_index
latency
, pipeline_depth
and wait_depth
) indexable by port_index
The following layers need to be updated:
ConvolutionLayer
PoolingLayer
InnerProductLayer
ReLULayer
ConcatLayer
EltwiseLayer
SplitLayer
SqueezeLayer
I've found that some conv ops seem to be missing the 'bias' input.
I think the parser is currently expecting both weights and biases so this may break it?
The ONNX model is correct against the pytorch implementation so I'm assuming this isn't an issue with pytorch or onnx export.
Once I've worked the transforms that need to change and ones that need to be added I will split up this issue.
For networks such as ResNet, they rely on a Eltwise layer. This needs to be implemented.
The communication layers are always located at the end or start of the partition and should point to the next partition following the links in the cluster definition.
In an effort to make this project more user friendly, documentation should be in place. We will use pdoc3
to generate the documentation. Docstrings will follow the numpy format.
BufferLayer
SplitLayer
ExitConditionLayer
ExitSelectionLayer
To support parallel block, we need to add a layer which can split a stream to several outputs. This is to simplify layers such as PoolingLayer
and ConvolutionLayer
, where it is simpler to design for a single port in and out.
We may want certain layers to stay in the same partition, which can help the optimiser make wise decisions. One possible way is introducing the concept of block of layers. The other way is adding some flags in the partition to allow/forbid the merge/split. It is also important to think about how to pass such constraints to the optimiser.
Concatenation layer in order to implement networks such as GoogleNet.
look at https://github.com/AlexMontgomerie/fpgaConvNet2 for the legacy implementation
The resource model coefficients are currently part of the repo (fpgaconvnet_optimiser/coefficients
), however keeping them in the repo leads to large pull requests, as they change quite frequently as the resource models change. It would be good to organise these outside of the repo, and have a way of downloading them easily. Maybe when the resource models are changing less, we can move them back into the repo.
Currently the rates matrix shows us the rate of each layer independent of each other. In reality, the rate of previous layers affects the later layers and vice versa. A method needs to be created that balances the rates such that the input and output rate of two sequential layers match. This will be useful in getting more accurate performance estimations and so on.
For the ExitSelectionLayer
it would be useful to have a select/mux type module.
Adding explicit IDs to partitions allows for more clarity when matching partitions to platforms.
Is it the name? Say if two nodes in the network has the same function and same number of edges in and out? Or for example when looking for a vertical merge opportunity. How do we know that we are in fact looking at a vertical copy of the partition and not another partition that has the same output node and input?
MatMul operation generated by some pytorch onnx conversion instead of GEMM operation.
Add this to the parser.
One way to further exploit the parallelism from multiple FPGAs is to split the featuremap computation across different FPGA partitions. This can be introduced in the tiling
transform.
Linked to #38 .
@AlexMontgomerie Should the names of the layers be indexed using the platform they point to or should it be represented by a pair_id?
Indexed by platform they point to:
Currently the optimiser can produce a network description, however there is no method of loading an existing network description. This could be useful for design checkpointing and so on.
For the time being I'm going to work on parsing the graph - specifically the subgraphs which I think are ignored at the moment.
Placeholders for the following ops will be put in for the time being.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.