Skip to content

Commit 0b3275f

Browse files
authored
Merge pull request #91 from MicrochipTech/update_dev_shuran
Update the dev branch
2 parents ccff3ef + d963885 commit 0b3275f

28 files changed

+770
-47
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Example | Description
2424
[axi_target](./axi_target)|Example of an AXI4-Target top-level interface.
2525
[Canny_RISCV](./Canny_RISCV)|Integrating a SmartHLS module created using the IP Flow into the RISC-V subsystem.
2626
[ECC_demo](./ECC_demo)|Example of Error Correction Code feature.
27+
[auto_instrument](./auto_instrument/)|Example of Automatic On-Chip Instrumentation feature.
2728

2829
## Simple Examples
2930
Example | Description

Training1/readme.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1015,7 +1015,7 @@ run in the pipeline at each stage. The leftmost column indicates the
10151015
loop iteration for the instructions in the row starting (from Iteration
10161016
0). For function pipelines, Iteration 0 corresponds to the first input.
10171017

1018-
If you hold you mouse over an instruction you will see more details
1018+
If you hold your mouse over an instruction you will see more details
10191019
about the operation type.
10201020

10211021
<p align="center"><img src=".//media/steady_state_alpha_blend_pipeline_viewer.png" /></br>Figure 15: SmartHLS Schedule Viewer: Pipeline Viewer</p></br>
@@ -1577,7 +1577,7 @@ SmartHLS.
15771577
Pipelining is a common HLS optimization used to increase hardware
15781578
throughput and to better utilize FPGA hardware resources. We also
15791579
covered the concept of loop pipelining in the SmartHLS Sobel Filter
1580-
Tutorial. In Figure 18a) shows a loop to be scheduled with 3
1580+
Tutorial. Figure 18a) shows a loop to be scheduled with 3
15811581
single-cycle operations: Load, Comp, Store. We show a comparison of the
15821582
cycle-by-cycle operations when hardware operations in a loop are
15831583
implemented b) sequentially (default) or c) pipelined (with SmartHLS
@@ -1606,7 +1606,7 @@ pragma or the function pipeline pragma:
16061606
```
16071607
Loop pipelining only applies to a specific loop in a C++ function.
16081608
Meanwhile, function pipelining is applied to an entire C++ function and
1609-
SmartHLS will automatically unrolls all loops in that function.
1609+
SmartHLS will automatically unroll all loops in that function.
16101610

16111611
## SmartHLS Pipelining Hazards: Why Initiation Interval Cannot Always Be 1
16121612

@@ -2710,14 +2710,12 @@ the user specifies an incorrect value in a SmartHLS pragma. For example,
27102710
specifying an incorrect depth on a memory interface such as the
27112711
following on line 29:
27122712
```c
2713-
#pragma HLS interface argument(input_buffer) type(memory)
2714-
num_elements(SIZE)
2713+
#pragma HLS interface argument(input_buffer) type(memory) num_elements(SIZE)
27152714
```
27162715
For example, we can try changing the correct SIZE array depth to a wrong
27172716
value like 10:
27182717
```c
2719-
#pragma HLS interface argument(input_buffer) type(memory)
2720-
num_elements(10)
2718+
#pragma HLS interface argument(input_buffer) type(memory) num_elements(10)
27212719
```
27222720

27232721
Now we rerun SmartHLS to generate the hardware

Training2/readme.md

Lines changed: 29 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -783,7 +783,7 @@ int main() {
783783

784784
## Verification: Co-simulation of Multi-threaded SmartHLS Code
785785

786-
As mentioned before the `producer_consumer` project cannot be simulated
786+
As mentioned before, the `producer_consumer` project cannot be simulated
787787
with co-simulation. This is because the `producer_consumer` project has
788788
threads that run forever and do not finish before the top-level function
789789
returns. SmartHLS co-simulation supports a single call to the top-level
@@ -1127,11 +1127,11 @@ is defined by the number of rows, columns, and the depth.
11271127
Convolution layers are good for extracting geometric features from an
11281128
input tensor. Convolution layers work in the same way as an image
11291129
processing filter (such as the Sobel filter) where a square filter
1130-
(called a **kernel**) is slid across an input image. The **size** of
1131-
filter is equal to the side length of the square filter, and the size of
1132-
the step when sliding the filter is called the **stride**. The values of
1133-
the input tensor under the kernel (called the **window**) and the values
1134-
of the kernel are multiplied and summed at each step, which is also
1130+
(called a **kernel**) is slid across an input image. The **size** of a
1131+
filter is equal to its side length, and the size of the step when sliding
1132+
the filter is called the **stride**. The values of the input tensor
1133+
under the kernel (called the **window**) and the values of the
1134+
kernel are multiplied and summed at each step, which is also
11351135
called a convolution. Figure 13 shows an example of a convolution layer
11361136
processing an input tensor with a depth of 1.
11371137

@@ -1580,16 +1580,16 @@ we show the input tensor values and convolution filters involved in the
15801580
computation of the set of colored output tensor values (see Loop 3
15811581
arrow).
15821582

1583-
Loop 1 and Loop 2 the code traverses along the row and column dimensions
1583+
For Loop 1 and Loop 2, the code traverses along the row and column dimensions
15841584
of the output tensor. Loop 3 traverses along the depth dimension of the
1585-
output tensor, each iteration computes a `PARALLEL_KERNELS` number of
1585+
output tensor, and each iteration computes a total of `PARALLEL_KERNELS`
15861586
outputs. The `accumulated_value` array will hold the partial
15871587
dot-products. Loop 4 traverses along the row and column dimensions of
1588-
the input tensor and convolution filter kernels. Then Loop 5 walks
1589-
through each of the `PARALLEL_KERNELS` number of selected convolution
1588+
the input tensor and convolution filter kernels. Then, Loop 5 walks
1589+
through each of the `PARALLEL_KERNELS` selected convolution
15901590
filters and Loop 6 traverses along the depth dimension of the input
15911591
tensor. Loop 7 and Loop 8 add up the partial sums together with biases
1592-
to produce `PARALLEL_KERNEL` number of outputs.
1592+
to produce `PARALLEL_KERNEL` outputs.
15931593

15941594
```C
15951595
const static unsigned PARALLEL_KERNELS = NUM_MACC / INPUT_DEPTH;
@@ -2203,7 +2203,7 @@ instructions that always run together with a single entry point at the
22032203
beginning and a single exit point at the end. A basic block in LLVM IR
22042204
always has a label at the beginning and a branching instruction at the
22052205
end (br, ret, etc.). An example of LLVM IR is shown below, where the
2206-
`body.0` basic block performs an addition (add) and subtraction (sub) and
2206+
`body.0` basic block performs an addition (add) and subtraction (sub), and
22072207
then branches unconditionally (br) to another basic block labeled
22082208
`body.1`. Control flow occurs between basic blocks.
22092209

@@ -2231,7 +2231,7 @@ button (![](.//media/image28.png)) to build the design and generate the
22312231
schedule.
22322232

22332233
We can ignore the `printWarningMessageForGlobalArrayReset` warning message
2234-
for global variable a in this example as described in the producer
2234+
for global variable `a` in this example as described in the producer
22352235
consumer example in the [section 'Producer Consumer Example'](#producer-consumer-example).
22362236

22372237
The first example we will look at is the `no_dependency` example on line
@@ -2240,7 +2240,7 @@ The first example we will look at is the `no_dependency` example on line
22402240

22412241
<p align="center"><img src=".//media/image19.png" /></p>
22422242

2243-
```
2243+
```c++
22442244
8 void no_dependency() {
22452245
9 #pragma HLS function noinline
22462246
10 e = b + c;
@@ -2252,10 +2252,10 @@ The first example we will look at is the `no_dependency` example on line
22522252
<p align="center">Figure 28: Source code and data dependency graph for no_dependency
22532253
function.</p>
22542254
2255-
In this example, values are loaded from b, c, and d and additions happen
2256-
before storing to *e*, *f*, and *g*. None of the adds use results from
2255+
In this example, values are loaded from `b`, `c`, and `d`, and additions happen
2256+
before storing to `e`, `f`, and `g`. None of the adds use results from
22572257
the previous adds and thus all three adds can happen in parallel. The
2258-
*noinline* pragma is used to prevent SmartHLS from automatically
2258+
`noinline` pragma is used to prevent SmartHLS from automatically
22592259
inlining this small function and making it harder for us to understand
22602260
the schedule. Inlining is when the instructions in the called function
22612261
get copied into the caller, to remove the overhead of the function call
@@ -2290,17 +2290,17 @@ the store instruction highlighted in yellow depends on the result of the
22902290
add instruction as we expect.
22912291
22922292
We have declared all the variables used in this function as
2293-
**volatile**. The volatile C/C++ keyword specifies that the variable can
2293+
**volatile**. The `volatile` C/C++ keyword specifies that the variable can
22942294
be updated by something other than the program itself, making sure that
22952295
any operation with these variables do not get optimized away by the
22962296
compiler as every operation matters. An example of where the compiler
22972297
handles this incorrectly is seen in the [section 'Producer Consumer Example'](#producer-consumer-example), where we had to
22982298
declare a synchronization signal between two threaded functions as
2299-
volatile. Using volatile is required for toy examples to make sure each
2299+
`volatile`. Using `volatile` is required for toy examples to make sure each
23002300
operation we perform with these variables will be generated in hardware
23012301
and viewable in the Schedule Viewer.
23022302
2303-
```
2303+
```c++
23042304
4 volatile int a[5] = {0};
23052305
5 volatile int b = 0, c = 0, d = 0;
23062306
6 volatile int e, f, g;
@@ -2315,7 +2315,7 @@ code and SmartHLS cannot schedule all instructions in the first cycle.
23152315

23162316
<p align="center"><img src=".//media/image68.png" /></p>
23172317

2318-
```
2318+
```c++
23192319
15 void data_dependency() {
23202320
16 #pragma HLS function noinline
23212321
17 e = b + c;
@@ -2337,8 +2337,8 @@ second add is also used in the third add. These are examples of data
23372337
dependencies as later adds use the data result of previous adds. Because
23382338
we must wait for the result `e` to be produced before we can compute `f`,
23392339
and then the result `f` must be produced before we can compute `g`, not all
2340-
instructions can be scheduled immediately. They must wait for their
2341-
dependent instructions to finish executing before they can start, or
2340+
instructions can be scheduled immediately. They must wait for the instructions
2341+
they depend on to finish executing before they can start, or
23422342
they would produce the wrong result.
23432343
23442344
<p align="center"><img src=".//media/image70.png" /></br>
@@ -2375,7 +2375,7 @@ memories.
23752375
23762376
<p align="center"><img src=".//media/image72.png" /></p>
23772377
2378-
```
2378+
```c++
23792379
22 void memory_dependency() {
23802380
23 #pragma HLS function noinline
23812381
24 volatile int i = 0;
@@ -2419,7 +2419,7 @@ resource cannot be scheduled in parallel due to a lack of resources.
24192419
`resource_contention` function on line 30 of
24202420
`instruction_level_parallelism.cpp`.
24212421

2422-
```
2422+
```c++
24232423
30 void resource_contention() {
24242424
31 #pragma HLS function noinline
24252425
32 e = a[0];
@@ -2452,7 +2452,7 @@ when generating the schedule for a design.
24522452
Next, we will see an example of how loops prevent operations from being
24532453
scheduled in parallel.
24542454
2455-
```
2455+
```c++
24562456
37 void no_loop_unroll() {
24572457
38 #pragma HLS function noinline
24582458
39 int h = 0;
@@ -2481,10 +2481,9 @@ has no unrolling on the loop and `loop_unroll` unrolls the loop
24812481
completely. This affects the resulting hardware by removing the control
24822482
signals needed to facilitate the loop and combining multiple loop bodies
24832483
into the same basic block, allowing more instructions to be scheduled in
2484-
parallel. The trade-off here is an unrolled loop does not reuse hardware
2485-
resources and can potentially use a lot of resources. However, the
2486-
unrolled loop would finish earlier depending on how inherently parallel
2487-
the loop body is.
2484+
parallel. The trade-off here is that an unrolled loop does not reuse hardware
2485+
resources and can potentially use a lot of resources, however it will
2486+
finish earlier depending on how inherently parallel the loop body is.
24882487

24892488
![](.//media/image3.png)To see the effects of this, open the Schedule
24902489
Viewer and first click on the `no_loop_unroll` function shown in Figure

Training3/readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2002,7 +2002,7 @@ format to write and read from the addresses that line up with addresses
20022002
in the AXI target interface in the SmartHLS core. For burst mode, the
20032003
processor will also write to and read from addresses corresponding to
20042004
the DDR memory. Note, the pointers are cast as volatile to prevent the
2005-
SoftConsole compiler from optimization away these reads and writes. The
2005+
SoftConsole compiler from optimizating away these reads and writes. The
20062006
Mi-V then asserts the run signal and waits until the accelerator
20072007
de-asserts it, signally the computing is done. The Mi-V then reads from
20082008
the memory to issue read requests and get the results from the

Training4/readme.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ array.
385385

386386
```c
387387
24 // The core logic of this example
388-
25 void vector_add_sw(int a, int b, int result) {
388+
25 void vector_add_sw(int* a, int* b, int* result) {
389389
26 for (int i = 0; i < SIZE; i++) {
390390
27 result[i] = a[i] + b[i];
391391
28 }
@@ -398,7 +398,7 @@ Now we look on line 70 at the `vector_add_axi_target_memcpy` top-level
398398
C++ function as shown in Figure 6‑12.
399399
400400
```c
401-
70 void vector_add_axi_target_memcpy(int a, int b, int result) {
401+
70 void vector_add_axi_target_memcpy(int* a, int* b, int* result) {
402402
71 #pragma HLS function top
403403
72 #pragma HLS interface control type(axi_target)
404404
73 #pragma HLS interface argument(a) type(axi_target) num_elements(SIZE)
@@ -500,7 +500,7 @@ corresponding to the C++ top-level function as shown below in Figure
500500
clock and a single AXI4 Target port. Due to the large number of AXI4
501501
ports in the RTL, SmartHLS uses a wildcard “`axi4target_*`” to
502502
simplify the table. The “Control AXI4 Target” indicates that
503-
start/finish control as done using the AXI target interface. Each of the
503+
start/finish control is done using the AXI target interface. Each of the
504504
function’s three arguments also use the AXI target interface. The
505505
address map of the AXI target port is given later in the report.
506506

auto_instrument/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
SRCS=main.cpp
2+
LOCAL_CONFIG = -legup-config=config.tcl
3+
HLS_INSTRUMENT_ENABLE=1

0 commit comments

Comments
 (0)