Skip to content

Update the dev branch #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Example | Description
[axi_target](./axi_target)|Example of an AXI4-Target top-level interface.
[Canny_RISCV](./Canny_RISCV)|Integrating a SmartHLS module created using the IP Flow into the RISC-V subsystem.
[ECC_demo](./ECC_demo)|Example of Error Correction Code feature.
[auto_instrument](./auto_instrument/)|Example of Automatic On-Chip Instrumentation feature.

## Simple Examples
Example | Description
Expand Down
12 changes: 5 additions & 7 deletions Training1/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -1015,7 +1015,7 @@ run in the pipeline at each stage. The leftmost column indicates the
loop iteration for the instructions in the row starting (from Iteration
0). For function pipelines, Iteration 0 corresponds to the first input.

If you hold you mouse over an instruction you will see more details
If you hold your mouse over an instruction you will see more details
about the operation type.

<p align="center"><img src=".//media/steady_state_alpha_blend_pipeline_viewer.png" /></br>Figure 15: SmartHLS Schedule Viewer: Pipeline Viewer</p></br>
Expand Down Expand Up @@ -1577,7 +1577,7 @@ SmartHLS.
Pipelining is a common HLS optimization used to increase hardware
throughput and to better utilize FPGA hardware resources. We also
covered the concept of loop pipelining in the SmartHLS Sobel Filter
Tutorial. In Figure 18a) shows a loop to be scheduled with 3
Tutorial. Figure 18a) shows a loop to be scheduled with 3
single-cycle operations: Load, Comp, Store. We show a comparison of the
cycle-by-cycle operations when hardware operations in a loop are
implemented b) sequentially (default) or c) pipelined (with SmartHLS
Expand Down Expand Up @@ -1606,7 +1606,7 @@ pragma or the function pipeline pragma:
```
Loop pipelining only applies to a specific loop in a C++ function.
Meanwhile, function pipelining is applied to an entire C++ function and
SmartHLS will automatically unrolls all loops in that function.
SmartHLS will automatically unroll all loops in that function.

## SmartHLS Pipelining Hazards: Why Initiation Interval Cannot Always Be 1

Expand Down Expand Up @@ -2710,14 +2710,12 @@ the user specifies an incorrect value in a SmartHLS pragma. For example,
specifying an incorrect depth on a memory interface such as the
following on line 29:
```c
#pragma HLS interface argument(input_buffer) type(memory)
num_elements(SIZE)
#pragma HLS interface argument(input_buffer) type(memory) num_elements(SIZE)
```
For example, we can try changing the correct SIZE array depth to a wrong
value like 10:
```c
#pragma HLS interface argument(input_buffer) type(memory)
num_elements(10)
#pragma HLS interface argument(input_buffer) type(memory) num_elements(10)
```

Now we rerun SmartHLS to generate the hardware
Expand Down
59 changes: 29 additions & 30 deletions Training2/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -783,7 +783,7 @@ int main() {

## Verification: Co-simulation of Multi-threaded SmartHLS Code

As mentioned before the `producer_consumer` project cannot be simulated
As mentioned before, the `producer_consumer` project cannot be simulated
with co-simulation. This is because the `producer_consumer` project has
threads that run forever and do not finish before the top-level function
returns. SmartHLS co-simulation supports a single call to the top-level
Expand Down Expand Up @@ -1127,11 +1127,11 @@ is defined by the number of rows, columns, and the depth.
Convolution layers are good for extracting geometric features from an
input tensor. Convolution layers work in the same way as an image
processing filter (such as the Sobel filter) where a square filter
(called a **kernel**) is slid across an input image. The **size** of
filter is equal to the side length of the square filter, and the size of
the step when sliding the filter is called the **stride**. The values of
the input tensor under the kernel (called the **window**) and the values
of the kernel are multiplied and summed at each step, which is also
(called a **kernel**) is slid across an input image. The **size** of a
filter is equal to its side length, and the size of the step when sliding
the filter is called the **stride**. The values of the input tensor
under the kernel (called the **window**) and the values of the
kernel are multiplied and summed at each step, which is also
called a convolution. Figure 13 shows an example of a convolution layer
processing an input tensor with a depth of 1.

Expand Down Expand Up @@ -1580,16 +1580,16 @@ we show the input tensor values and convolution filters involved in the
computation of the set of colored output tensor values (see Loop 3
arrow).

Loop 1 and Loop 2 the code traverses along the row and column dimensions
For Loop 1 and Loop 2, the code traverses along the row and column dimensions
of the output tensor. Loop 3 traverses along the depth dimension of the
output tensor, each iteration computes a `PARALLEL_KERNELS` number of
output tensor, and each iteration computes a total of `PARALLEL_KERNELS`
outputs. The `accumulated_value` array will hold the partial
dot-products. Loop 4 traverses along the row and column dimensions of
the input tensor and convolution filter kernels. Then Loop 5 walks
through each of the `PARALLEL_KERNELS` number of selected convolution
the input tensor and convolution filter kernels. Then, Loop 5 walks
through each of the `PARALLEL_KERNELS` selected convolution
filters and Loop 6 traverses along the depth dimension of the input
tensor. Loop 7 and Loop 8 add up the partial sums together with biases
to produce `PARALLEL_KERNEL` number of outputs.
to produce `PARALLEL_KERNEL` outputs.

```C
const static unsigned PARALLEL_KERNELS = NUM_MACC / INPUT_DEPTH;
Expand Down Expand Up @@ -2203,7 +2203,7 @@ instructions that always run together with a single entry point at the
beginning and a single exit point at the end. A basic block in LLVM IR
always has a label at the beginning and a branching instruction at the
end (br, ret, etc.). An example of LLVM IR is shown below, where the
`body.0` basic block performs an addition (add) and subtraction (sub) and
`body.0` basic block performs an addition (add) and subtraction (sub), and
then branches unconditionally (br) to another basic block labeled
`body.1`. Control flow occurs between basic blocks.

Expand Down Expand Up @@ -2231,7 +2231,7 @@ button (![](.//media/image28.png)) to build the design and generate the
schedule.

We can ignore the `printWarningMessageForGlobalArrayReset` warning message
for global variable a in this example as described in the producer
for global variable `a` in this example as described in the producer
consumer example in the [section 'Producer Consumer Example'](#producer-consumer-example).

The first example we will look at is the `no_dependency` example on line
Expand All @@ -2240,7 +2240,7 @@ The first example we will look at is the `no_dependency` example on line

<p align="center"><img src=".//media/image19.png" /></p>

```
```c++
8 void no_dependency() {
9 #pragma HLS function noinline
10 e = b + c;
Expand All @@ -2252,10 +2252,10 @@ The first example we will look at is the `no_dependency` example on line
<p align="center">Figure 28: Source code and data dependency graph for no_dependency
function.</p>

In this example, values are loaded from b, c, and d and additions happen
before storing to *e*, *f*, and *g*. None of the adds use results from
In this example, values are loaded from `b`, `c`, and `d`, and additions happen
before storing to `e`, `f`, and `g`. None of the adds use results from
the previous adds and thus all three adds can happen in parallel. The
*noinline* pragma is used to prevent SmartHLS from automatically
`noinline` pragma is used to prevent SmartHLS from automatically
inlining this small function and making it harder for us to understand
the schedule. Inlining is when the instructions in the called function
get copied into the caller, to remove the overhead of the function call
Expand Down Expand Up @@ -2290,17 +2290,17 @@ the store instruction highlighted in yellow depends on the result of the
add instruction as we expect.

We have declared all the variables used in this function as
**volatile**. The volatile C/C++ keyword specifies that the variable can
**volatile**. The `volatile` C/C++ keyword specifies that the variable can
be updated by something other than the program itself, making sure that
any operation with these variables do not get optimized away by the
compiler as every operation matters. An example of where the compiler
handles this incorrectly is seen in the [section 'Producer Consumer Example'](#producer-consumer-example), where we had to
declare a synchronization signal between two threaded functions as
volatile. Using volatile is required for toy examples to make sure each
`volatile`. Using `volatile` is required for toy examples to make sure each
operation we perform with these variables will be generated in hardware
and viewable in the Schedule Viewer.

```
```c++
4 volatile int a[5] = {0};
5 volatile int b = 0, c = 0, d = 0;
6 volatile int e, f, g;
Expand All @@ -2315,7 +2315,7 @@ code and SmartHLS cannot schedule all instructions in the first cycle.

<p align="center"><img src=".//media/image68.png" /></p>

```
```c++
15 void data_dependency() {
16 #pragma HLS function noinline
17 e = b + c;
Expand All @@ -2337,8 +2337,8 @@ second add is also used in the third add. These are examples of data
dependencies as later adds use the data result of previous adds. Because
we must wait for the result `e` to be produced before we can compute `f`,
and then the result `f` must be produced before we can compute `g`, not all
instructions can be scheduled immediately. They must wait for their
dependent instructions to finish executing before they can start, or
instructions can be scheduled immediately. They must wait for the instructions
they depend on to finish executing before they can start, or
they would produce the wrong result.

<p align="center"><img src=".//media/image70.png" /></br>
Expand Down Expand Up @@ -2375,7 +2375,7 @@ memories.

<p align="center"><img src=".//media/image72.png" /></p>

```
```c++
22 void memory_dependency() {
23 #pragma HLS function noinline
24 volatile int i = 0;
Expand Down Expand Up @@ -2419,7 +2419,7 @@ resource cannot be scheduled in parallel due to a lack of resources.
`resource_contention` function on line 30 of
`instruction_level_parallelism.cpp`.

```
```c++
30 void resource_contention() {
31 #pragma HLS function noinline
32 e = a[0];
Expand Down Expand Up @@ -2452,7 +2452,7 @@ when generating the schedule for a design.
Next, we will see an example of how loops prevent operations from being
scheduled in parallel.

```
```c++
37 void no_loop_unroll() {
38 #pragma HLS function noinline
39 int h = 0;
Expand Down Expand Up @@ -2481,10 +2481,9 @@ has no unrolling on the loop and `loop_unroll` unrolls the loop
completely. This affects the resulting hardware by removing the control
signals needed to facilitate the loop and combining multiple loop bodies
into the same basic block, allowing more instructions to be scheduled in
parallel. The trade-off here is an unrolled loop does not reuse hardware
resources and can potentially use a lot of resources. However, the
unrolled loop would finish earlier depending on how inherently parallel
the loop body is.
parallel. The trade-off here is that an unrolled loop does not reuse hardware
resources and can potentially use a lot of resources, however it will
finish earlier depending on how inherently parallel the loop body is.

![](.//media/image3.png)To see the effects of this, open the Schedule
Viewer and first click on the `no_loop_unroll` function shown in Figure
Expand Down
2 changes: 1 addition & 1 deletion Training3/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2002,7 +2002,7 @@ format to write and read from the addresses that line up with addresses
in the AXI target interface in the SmartHLS core. For burst mode, the
processor will also write to and read from addresses corresponding to
the DDR memory. Note, the pointers are cast as volatile to prevent the
SoftConsole compiler from optimization away these reads and writes. The
SoftConsole compiler from optimizating away these reads and writes. The
Mi-V then asserts the run signal and waits until the accelerator
de-asserts it, signally the computing is done. The Mi-V then reads from
the memory to issue read requests and get the results from the
Expand Down
6 changes: 3 additions & 3 deletions Training4/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ array.

```c
24 // The core logic of this example
25 void vector_add_sw(int a, int b, int result) {
25 void vector_add_sw(int* a, int* b, int* result) {
26 for (int i = 0; i < SIZE; i++) {
27 result[i] = a[i] + b[i];
28 }
Expand All @@ -398,7 +398,7 @@ Now we look on line 70 at the `vector_add_axi_target_memcpy` top-level
C++ function as shown in Figure 6‑12.

```c
70 void vector_add_axi_target_memcpy(int a, int b, int result) {
70 void vector_add_axi_target_memcpy(int* a, int* b, int* result) {
71 #pragma HLS function top
72 #pragma HLS interface control type(axi_target)
73 #pragma HLS interface argument(a) type(axi_target) num_elements(SIZE)
Expand Down Expand Up @@ -500,7 +500,7 @@ corresponding to the C++ top-level function as shown below in Figure
clock and a single AXI4 Target port. Due to the large number of AXI4
ports in the RTL, SmartHLS uses a wildcard “`axi4target_*`” to
simplify the table. The “Control AXI4 Target” indicates that
start/finish control as done using the AXI target interface. Each of the
start/finish control is done using the AXI target interface. Each of the
function’s three arguments also use the AXI target interface. The
address map of the AXI target port is given later in the report.

Expand Down
3 changes: 3 additions & 0 deletions auto_instrument/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SRCS=main.cpp
LOCAL_CONFIG = -legup-config=config.tcl
HLS_INSTRUMENT_ENABLE=1
Loading