Skip to content

Commit b19f771

Browse files
authored
Merge branch 'main' into vivado_codegen_namespace
2 parents f95b34f + 3b7e595 commit b19f771

File tree

175 files changed

+8865
-1484
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

175 files changed

+8865
-1484
lines changed

.pre-commit-config.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,26 @@ exclude: (^hls4ml\/templates\/(vivado|quartus)\/(ap_types|ac_types)\/|^test/pyte
22

33
repos:
44
- repo: https://github.com/psf/black
5-
rev: 24.10.0
5+
rev: 25.1.0
66
hooks:
77
- id: black
88
language_version: python3
99
args: ['--line-length=125',
1010
'--skip-string-normalization']
1111

12+
- repo: https://github.com/tox-dev/pyproject-fmt
13+
rev: v2.5.0
14+
hooks:
15+
- id: pyproject-fmt
16+
1217
- repo: https://github.com/pre-commit/pre-commit-hooks
1318
rev: v5.0.0
1419
hooks:
1520
- id: check-added-large-files
1621
- id: check-case-conflict
1722
- id: check-merge-conflict
1823
- id: check-symlinks
24+
- id: check-toml
1925
- id: check-yaml
2026
- id: debug-statements
2127
- id: end-of-file-fixer
@@ -24,22 +30,16 @@ repos:
2430
- id: trailing-whitespace
2531

2632
- repo: https://github.com/PyCQA/isort
27-
rev: 5.13.2
33+
rev: 6.0.0
2834
hooks:
2935
- id: isort
30-
args: ["--profile", "black", --line-length=125]
3136

3237
- repo: https://github.com/asottile/pyupgrade
33-
rev: v3.19.0
38+
rev: v3.19.1
3439
hooks:
3540
- id: pyupgrade
3641
args: ["--py36-plus"]
3742

38-
- repo: https://github.com/asottile/setup-cfg-fmt
39-
rev: v2.7.0
40-
hooks:
41-
- id: setup-cfg-fmt
42-
4343
- repo: https://github.com/pycqa/flake8
4444
rev: 7.1.1
4545
hooks:

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ type: software
44
authors:
55
- given-names: "FastML Team"
66
title: "hls4ml"
7-
version: "v0.8.1"
7+
version: "v1.0.0"
88
doi: 10.5281/zenodo.1201549
99
repository-code: "https://github.com/fastmachinelearning/hls4ml"
1010
url: "https://fastmachinelearning.org/hls4ml"

MANIFEST.in

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
include LICENSE README.md CONTRIBUTING.md CITATION.cff pyproject.toml setup.py setup.cfg .clang-format
1+
include LICENSE README.md CONTRIBUTING.md CITATION.cff pyproject.toml .clang-format
22
graft example-models
33
graft test
44
graft contrib
55
recursive-include hls4ml/templates *
6-
global-exclude .git .gitmodules .gitlab-ci.yml
6+
recursive-include hls4ml *.py
7+
recursive-include hls4ml/contrib *
8+
global-exclude .git .gitmodules .gitlab-ci.yml *.pyc
79
include hls4ml/backends/vivado_accelerator/supported_boards.json

README.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ If you have any questions, comments, or ideas regarding hls4ml or just want to s
1515

1616
# Documentation & Tutorial
1717

18-
For more information visit the webpage: [https://fastmachinelearning.org/hls4ml/](https://fastmachinelearning.org/hls4ml/)
18+
For more information visit the webpage: [https://fastmachinelearning.org/hls4ml/](https://fastmachinelearning.org/hls4ml/).
19+
20+
For introductory material on FPGAs, HLS and ML inferences using hls4ml, check out the [video](https://www.youtube.com/watch?v=2y3GNY4tf7A&ab_channel=SystemsGroupatETHZ%C3%BCrich).
1921

2022
Detailed tutorials on how to use `hls4ml`'s various functionalities can be found [here](https://github.com/hls-fpga-machine-learning/hls4ml-tutorial).
2123

@@ -49,8 +51,8 @@ hls_model = hls4ml.converters.keras_to_hls(config)
4951
hls4ml.utils.fetch_example_list()
5052
```
5153

52-
### Building a project with Xilinx Vivado HLS (after downloading and installing from [here](https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html))
53-
Note: Vitis HLS is not yet supported. Vivado HLS versions between 2018.2 and 2020.1 are recommended.
54+
### Building a project.
55+
We will build the project using Xilinx Vivado HLS, which can be downloaded and installed from [here](https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html). Alongside Vivado HLS, hls4ml also supports Vitis HLS, Intel HLS, Catapult HLS and has some experimental support dor Intel oneAPI. The target back-end can be changed using the argument backend when building the model.
5456

5557
```Python
5658
# Use Vivado HLS to synthesize the model
@@ -61,15 +63,19 @@ hls_model.build()
6163
hls4ml.report.read_vivado_report('my-hls-test')
6264
```
6365

66+
# FAQ
67+
68+
List of frequently asked questions and common HLS synthesis can be found [here](https://fastmachinelearning.org/hls4ml/faq.html)
69+
6470
# Citation
6571
If you use this software in a publication, please cite the software
6672
```bibtex
6773
@software{fastml_hls4ml,
6874
author = {{FastML Team}},
6975
title = {fastmachinelearning/hls4ml},
70-
year = 2023,
76+
year = 2024,
7177
publisher = {Zenodo},
72-
version = {v0.8.1},
78+
version = {v1.0.0},
7379
doi = {10.5281/zenodo.1201549},
7480
url = {https://github.com/fastmachinelearning/hls4ml}
7581
}

docs/advanced/auto.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
=============================
2+
Automatic precision inference
3+
=============================
4+
5+
The automatic precision inference (implemented in :py:class:`~hls4ml.model.optimizer.passes.infer_precision.InferPrecisionTypes`) attempts to infer the appropriate
6+
widths for a given precision. It is initiated by setting a precision in the configuration as ``'auto'``. (Note, only layer-level precisions can be set to ``'auto'``,
7+
not model-level.) Functions like :py:class:`~hls4ml.utils.config.config_from_keras_model`, :py:class:`~hls4ml.utils.config.config_from_onnx_model`,
8+
and :py:class:`~hls4ml.utils.config.config_from_pytorch_model` automatically set most precisions to ``'auto'`` if the ``'name'`` granularity is used.
9+
10+
.. note::
11+
It is recommended to pass the backend to the ``config_from_*`` functions so that they can properly extract all the configurable precisions.
12+
13+
The approach taken by the precision inference is to set accumulator (the internal variable used to accumulate values in the matrix multiplications) and other precisions
14+
to never truncate, using only the bitwidths of the inputs (not the values). This is quite conservative, especially in cases where post-training quantization is used, or
15+
if the bit widths were set fairly loosely. The recommended action in that case is to edit the configuration and explicitly set some widths in it, potentially in an iterative process
16+
after profiling the data. Another option is to pass a maximum precision using the ``max_precison`` parameter of the ``config_form_*`` functions. Then the automatic precision
17+
inference will never set a bitwdith larger than the bitwidth of the ``max_precision`` or an integer part larger than the integer part of the ``max_precision`` that is passed.
18+
(The bitwidth and integer parts of the ``max_precision`` are treated separately.)
19+
20+
When manually setting bitdwidths, the accumulator can overflow, and the precision may need to be reduced. For the accumulator, it is usually a bad idea to explicitly
21+
enable rounding or saturation modes since it dramatically increases the execution time. For other types (e.g. output types or weight types), however, rounding and saturation handling
22+
can be enabled as needed.

docs/advanced/bramfactor.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
==================================
2+
Loading weights from external BRAM
3+
==================================
4+
5+
.. note::
6+
This feature is being evaluated for re-implementation. We welcome feedback from users how to make the implementation more flexible.
7+
8+
``hls4ml`` can optionally store weights in BRAMs external to the design. This is supported in Vivado/Vitis and Catapult backends. It is the responsibility of the user to ensure the weights are properly loaded during the operation of the design.
9+
10+
The feature works as a threshold, exposed through a ``BramFactor`` config parameter. Layers with more weights above the threshold will be exposed as BRAM interface. Consider the following code:
11+
12+
.. code-block:: Python
13+
14+
model = tf.keras.models.Sequential()
15+
model.add(Dense(10, activation="relu", input_shape=(12,), name="dense_1"))
16+
model.add(Dense(20, activation="relu", name="dense_2"))
17+
model.add(Dense(5, activation="softmax", name="dense_3"))
18+
model.compile(optimizer='adam', loss='mse')
19+
20+
config = hls4ml.utils.config_from_keras_model(model)
21+
config["Model"]["Strategy"] = "Resource"
22+
config["Model"]["BramFactor"] = 100
23+
24+
hls_model = hls4ml.converters.convert_from_keras_model(
25+
model, hls_config=config, output_dir=output_dir, io_type=io_type, backend=backend
26+
)
27+
28+
Having set ``BramFactor=100``, only layers with more than 100 weights will be exposed as external BRAM, in this case layers ``dense_1`` and ``dense_2``. ``BramFactor`` can currently be only set at the model level. The generated code will now have weights as part of the interface.
29+
30+
.. code-block:: C++
31+
32+
void myproject(
33+
hls::stream<input_t> &dense_1_input,
34+
hls::stream<result_t> &layer7_out,
35+
model_default_t w2[120],
36+
model_default_t w4[200]
37+
) {
38+
#pragma HLS INTERFACE axis port=dense_1_input,layer7_out
39+
#pragma HLS INTERFACE bram port=w2,w4
40+
...
41+
42+
When integrating the design, users can use the exposed interface to implement weight reloading scheme.

docs/advanced/hgq.rst

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
===================================
2+
High Granularity Quantization (HGQ)
3+
===================================
4+
5+
.. image:: https://github.com/calad0i/HGQ/actions/workflows/sphinx-build.yml/badge.svg
6+
:target: https://calad0i.github.io/HGQ/
7+
.. image:: https://badge.fury.io/py/hgq.svg
8+
:target: https://badge.fury.io/py/hgq
9+
.. image:: https://img.shields.io/badge/arXiv-2405.00645-b31b1b.svg
10+
:target: https://arxiv.org/abs/2405.00645
11+
12+
`High Granularity Quantization (HGQ) <https://github.com/calad0i/HGQ/>`_ is a library that performs gradient-based automatic bitwidth optimization and quantization-aware training algorithm for neural networks to be deployed on FPGAs. By leveraging gradients, it allows for bitwidth optimization at arbitrary granularity, up to per-weight and per-activation level.
13+
14+
.. image:: https://calad0i.github.io/HGQ/_images/overview.svg
15+
:alt: Overview of HGQ
16+
:align: center
17+
18+
Conversion of models made with HGQ library is fully supported. The HGQ models are first converted to proxy model format, which can then be parsed by hls4ml bit-accurately. Below is an example of how to create a model with HGQ and convert it to hls4ml model.
19+
20+
.. code-block:: Python
21+
22+
import keras
23+
from HGQ.layers import HDense, HDenseBatchNorm, HQuantize
24+
from HGQ import ResetMinMax, FreeBOPs
25+
26+
model = keras.models.Sequential([
27+
HQuantize(beta=1.e-5),
28+
HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
29+
HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
30+
HDense(10, beta=1.e-5),
31+
])
32+
33+
opt = keras.optimizers.Adam(learning_rate=0.001)
34+
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
35+
model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
36+
callbacks = [ResetMinMax(), FreeBOPs()]
37+
38+
model.fit(..., callbacks=callbacks)
39+
40+
from HGQ import trace_minmax, to_proxy_model
41+
from hls4ml.converters import convert_from_keras_model
42+
43+
trace_minmax(model, x_train, cover_factor=1.0)
44+
proxy = to_proxy_model(model, aggressive=True)
45+
46+
model_hls = convert_from_keras_model(proxy, backend='vivado',output_dir=... ,part=...)
47+
48+
49+
An interactive example of HGQ can be found in the `kaggle notebook <https://www.kaggle.com/code/calad0i/small-jet-tagger-with-hgq-1>`_. Full documentation can be found at `calad0i.github.io/HGQ <https://calad0i.github.io/HGQ/>`_.

docs/advanced/model_optimization.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,8 +124,8 @@ Finally, optimizing Vivado DSPs is possible, given a hls4ml config:
124124
acc_optimized = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_optimized, axis=1))
125125
print(f'Optimized Keras accuracy: {acc_optimized}')
126126
127-
There are two more Vivado "optimizers" - VivadoFFEstimator, aimed at reducing register utilisation and VivadoMultiObjectiveEstimator, aimed at optimising BRAM and DSP utilisation.
128-
Note, to ensure DSPs are optimized, "unrolled" Dense multiplication must be used before synthesing HLS, by modifying the config:
127+
There are two more Vivado "optimizers" - VivadoFFEstimator, aimed at reducing register utilization and VivadoMultiObjectiveEstimator, aimed at optimizing BRAM and DSP utilization.
128+
Note, to ensure DSPs are optimized, "unrolled" Dense multiplication must be used before synthesizing HLS, by modifying the config:
129129

130130
.. code-block:: Python
131131
File renamed without changes.

docs/command.rst renamed to docs/api/command.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ hls4ml config
5050
5151
hls4ml config [-h] [-m MODEL] [-w WEIGHTS] [-o OUTPUT]
5252
53-
This creates a conversion configuration file. Visit Configuration section of the :doc:`Setup <setup>` page for more details on how to write a configuration file.
53+
This creates a conversion configuration file. Visit Configuration section of the :doc:`Setup <../intro/setup>` page for more details on how to write a configuration file.
5454

5555
**Arguments**
5656

docs/api/concepts.rst

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
========
2+
Concepts
3+
========
4+
5+
How it Works
6+
----------------------
7+
8+
.. image:: ../img/nn_map_paper_fig_2.png
9+
:width: 70%
10+
:align: center
11+
12+
13+
Consider a multilayer neural network. At each neuron in a layer :math:`m` (containing :math:`N_m` neurons), we calculate an output value (part of the output vector :math:`\mathbf{x}_m` of said layer) using the sum of output values of the previous layer multiplied by independent weights for each of these values and a bias value. An activation function is performed on the result to get the final output value for the neuron. Representing the weights as a :math:`N_m` by :math:`N_{m-1}` matrix :math:`W_{m,m-1}`, the bias values as :math:`\mathbf{b}_m`, and the activation function as :math:`g_m`, we can express this compactly as:
14+
15+
16+
.. math::
17+
18+
\mathbf{x}_m = g_m (W_{m,m-1} \mathbf{x}_{m-1} +\mathbf{b}_m)
19+
20+
With hls4ml, each layer of output values is calculated independently in sequence, using pipelining to speed up the process by accepting new inputs after an initiation interval.
21+
The activations, if nontrivial, are precomputed.
22+
23+
To ensure optimal performance, the user can control aspects of their model, principally:
24+
25+
26+
* **Size/Compression** - Though not explicitly part of the ``hls4ml`` package, this is an important optimization to efficiently use the FPGA resources
27+
* **Precision** - Define the :doc:`precision <../advanced/profiling>` of the calculations in your model
28+
* **Dataflow/Resource Reuse** - Control parallel or streaming model implementations with varying levels of pipelining
29+
* **Quantization Aware Training** - Achieve best performance at low precision with tools like QKeras, and benefit automatically during inference with ``hls4ml`` parsing of QKeras models
30+
31+
32+
.. image:: ../img/reuse_factor_paper_fig_8.png
33+
:width: 70%
34+
:align: center
35+
36+
37+
Often, these decisions will be hardware dependent to maximize performance.
38+
Of note is that simplifying the input network must be done before using ``hls4ml`` to generate HLS code, for optimal compression to provide a sizable speedup.
39+
Also important to note is the use of fixed point arithmetic in ``hls4ml``.
40+
This improves processing speed relative to floating point implementations.
41+
The ``hls4ml`` package also offers the functionality of configuring binning and output bit width of the precomputed activation functions as necessary. With respect to parallelization and resource reuse, ``hls4ml`` offers a "reuse factor" parameter that determines the number of times each multiplier is used in order to compute a layer of neuron's values. Therefore, a reuse factor of one would split the computation so each multiplier had to only perform one multiplication in the computation of the output values of a layer, as shown above. Conversely, a reuse factor of four, in this case, uses a single multiplier four times sequentially. Low reuse factor achieves the lowest latency and highest throughput but uses the most resources, while high reuse factor save resources at the expense of longer latency and lower throughput.
42+
43+
44+
Frontends and Backends
45+
----------------------
46+
47+
``hls4ml`` has a concept of a **frontend** that parses the input NN into an internal model graph, and a **backend** that controls
48+
what type of output is produced from the graph. Frontends and backends can be independently chosen. Examples of frontends are the
49+
parsers for Keras or ONNX, and examples of backends are Vivado HLS, Intel HLS, and Vitis HLS. See :ref:`Status and Features` for the
50+
currently supported frontends and backends or the dedicated sections for each frontend/backend.
51+
52+
53+
I/O Types
54+
---------
55+
56+
``hls4ml`` supports multiple styles for handling data transfer to/from the network and between layers, known as the ``io_type``.
57+
58+
io_parallel
59+
^^^^^^^^^^^
60+
In this processing style, data is passed in parallel between the layers. Conceptually this corresponds to the C/C++ array where all elements can be accessed ay any time. This style allows for maximum parallelism and is well suited for MLP networks and small CNNs which aim for lowest latency. Due to the impact of parallel processing on resource utilization on FPGAs, the synthesis may fail for larger networks.
61+
62+
io_stream
63+
^^^^^^^^^
64+
As opposed to the parallel processing style, in ``io_stream`` mode data is passed one "pixel" at a time. Each pixel is an array of channels, which are always sent in parallel. This method for sending data between layers is recommended for larger CNN and RNN networks. For one-dimensional ``Dense`` layers, all the inputs are streamed in parallel as a single array.
65+
66+
With the ``io_stream`` IO type, each layer is connected with the subsequent layer through first-in first-out (FIFO) buffers.
67+
The implementation of the FIFO buffers contribute to the overall resource utilization of the design, impacting in particular the BRAM or LUT utilization.
68+
Because the neural networks can have complex architectures generally, it is hard to know a priori the correct depth of each FIFO buffer.
69+
By default ``hls4ml`` choses the most conservative possible depth for each FIFO buffer, which can result in a an unnecessary overutilization of resources.
70+
71+
In order to reduce the impact on the resources used for FIFO buffer implementation, we have a FIFO depth optimization flow. This is described
72+
in the :ref:`FIFO Buffer Depth Optimization` section.
73+
74+
75+
Strategy
76+
---------
77+
78+
**Strategy** in ``hls4ml`` refers to the implementation of core matrix-vector multiplication routine, which can be latency-oriented, resource-saving oriented, or specialized. Different strategies will have an impact on overall latency and resource consumption of each layer and users are advised to choose based on their design goals. The availability of particular strategy for a layer varies across backends, see the :doc:`Attributes <../ir/attributes>` section for a complete list of available strategies per-layer and per-backend.

0 commit comments

Comments
 (0)