@@ -783,7 +783,7 @@ int main() {
783
783
784
784
## Verification: Co-simulation of Multi-threaded SmartHLS Code
785
785
786
- As mentioned before the ` producer_consumer ` project cannot be simulated
786
+ As mentioned before, the ` producer_consumer ` project cannot be simulated
787
787
with co-simulation. This is because the ` producer_consumer ` project has
788
788
threads that run forever and do not finish before the top-level function
789
789
returns. SmartHLS co-simulation supports a single call to the top-level
@@ -1127,11 +1127,11 @@ is defined by the number of rows, columns, and the depth.
1127
1127
Convolution layers are good for extracting geometric features from an
1128
1128
input tensor. Convolution layers work in the same way as an image
1129
1129
processing filter (such as the Sobel filter) where a square filter
1130
- (called a ** kernel** ) is slid across an input image. The ** size** of
1131
- filter is equal to the side length of the square filter , and the size of
1132
- the step when sliding the filter is called the ** stride** . The values of
1133
- the input tensor under the kernel (called the ** window** ) and the values
1134
- of the kernel are multiplied and summed at each step, which is also
1130
+ (called a ** kernel** ) is slid across an input image. The ** size** of a
1131
+ filter is equal to its side length, and the size of the step when sliding
1132
+ the filter is called the ** stride** . The values of the input tensor
1133
+ under the kernel (called the ** window** ) and the values of the
1134
+ kernel are multiplied and summed at each step, which is also
1135
1135
called a convolution. Figure 13 shows an example of a convolution layer
1136
1136
processing an input tensor with a depth of 1.
1137
1137
@@ -1580,16 +1580,16 @@ we show the input tensor values and convolution filters involved in the
1580
1580
computation of the set of colored output tensor values (see Loop 3
1581
1581
arrow).
1582
1582
1583
- Loop 1 and Loop 2 the code traverses along the row and column dimensions
1583
+ For Loop 1 and Loop 2, the code traverses along the row and column dimensions
1584
1584
of the output tensor. Loop 3 traverses along the depth dimension of the
1585
- output tensor, each iteration computes a ` PARALLEL_KERNELS ` number of
1585
+ output tensor, and each iteration computes a total of ` PARALLEL_KERNELS `
1586
1586
outputs. The ` accumulated_value ` array will hold the partial
1587
1587
dot-products. Loop 4 traverses along the row and column dimensions of
1588
- the input tensor and convolution filter kernels. Then Loop 5 walks
1589
- through each of the ` PARALLEL_KERNELS ` number of selected convolution
1588
+ the input tensor and convolution filter kernels. Then, Loop 5 walks
1589
+ through each of the ` PARALLEL_KERNELS ` selected convolution
1590
1590
filters and Loop 6 traverses along the depth dimension of the input
1591
1591
tensor. Loop 7 and Loop 8 add up the partial sums together with biases
1592
- to produce ` PARALLEL_KERNEL ` number of outputs.
1592
+ to produce ` PARALLEL_KERNEL ` outputs.
1593
1593
1594
1594
``` C
1595
1595
const static unsigned PARALLEL_KERNELS = NUM_MACC / INPUT_DEPTH;
@@ -2203,7 +2203,7 @@ instructions that always run together with a single entry point at the
2203
2203
beginning and a single exit point at the end. A basic block in LLVM IR
2204
2204
always has a label at the beginning and a branching instruction at the
2205
2205
end (br, ret, etc.). An example of LLVM IR is shown below, where the
2206
- ` body.0 ` basic block performs an addition (add) and subtraction (sub) and
2206
+ ` body.0 ` basic block performs an addition (add) and subtraction (sub), and
2207
2207
then branches unconditionally (br) to another basic block labeled
2208
2208
` body.1 ` . Control flow occurs between basic blocks.
2209
2209
@@ -2231,7 +2231,7 @@ button () to build the design and generate the
2231
2231
schedule.
2232
2232
2233
2233
We can ignore the ` printWarningMessageForGlobalArrayReset ` warning message
2234
- for global variable a in this example as described in the producer
2234
+ for global variable ` a ` in this example as described in the producer
2235
2235
consumer example in the [ section 'Producer Consumer Example'] ( #producer-consumer-example ) .
2236
2236
2237
2237
The first example we will look at is the ` no_dependency ` example on line
@@ -2240,7 +2240,7 @@ The first example we will look at is the `no_dependency` example on line
2240
2240
2241
2241
<p align =" center " ><img src =" .//media/image19.png " /></p >
2242
2242
2243
- ```
2243
+ ``` c++
2244
2244
8 void no_dependency () {
2245
2245
9 #pragma HLS function noinline
2246
2246
10 e = b + c;
@@ -2252,10 +2252,10 @@ The first example we will look at is the `no_dependency` example on line
2252
2252
<p align="center">Figure 28: Source code and data dependency graph for no_dependency
2253
2253
function.</p>
2254
2254
2255
- In this example, values are loaded from b, c , and d and additions happen
2256
- before storing to * e * , * f * , and * g * . None of the adds use results from
2255
+ In this example, values are loaded from `b`, `c` , and `d`, and additions happen
2256
+ before storing to `e`, `f` , and `g` . None of the adds use results from
2257
2257
the previous adds and thus all three adds can happen in parallel. The
2258
- * noinline* pragma is used to prevent SmartHLS from automatically
2258
+ ` noinline` pragma is used to prevent SmartHLS from automatically
2259
2259
inlining this small function and making it harder for us to understand
2260
2260
the schedule. Inlining is when the instructions in the called function
2261
2261
get copied into the caller, to remove the overhead of the function call
@@ -2290,17 +2290,17 @@ the store instruction highlighted in yellow depends on the result of the
2290
2290
add instruction as we expect.
2291
2291
2292
2292
We have declared all the variables used in this function as
2293
- ** volatile** . The volatile C/C++ keyword specifies that the variable can
2293
+ **volatile**. The ` volatile` C/C++ keyword specifies that the variable can
2294
2294
be updated by something other than the program itself, making sure that
2295
2295
any operation with these variables do not get optimized away by the
2296
2296
compiler as every operation matters. An example of where the compiler
2297
2297
handles this incorrectly is seen in the [section 'Producer Consumer Example'](#producer-consumer-example), where we had to
2298
2298
declare a synchronization signal between two threaded functions as
2299
- volatile. Using volatile is required for toy examples to make sure each
2299
+ ` volatile` . Using ` volatile` is required for toy examples to make sure each
2300
2300
operation we perform with these variables will be generated in hardware
2301
2301
and viewable in the Schedule Viewer.
2302
2302
2303
- ```
2303
+ ```c++
2304
2304
4 volatile int a[5] = {0};
2305
2305
5 volatile int b = 0, c = 0, d = 0;
2306
2306
6 volatile int e, f, g;
@@ -2315,7 +2315,7 @@ code and SmartHLS cannot schedule all instructions in the first cycle.
2315
2315
2316
2316
<p align =" center " ><img src =" .//media/image68.png " /></p >
2317
2317
2318
- ```
2318
+ ``` c++
2319
2319
15 void data_dependency () {
2320
2320
16 #pragma HLS function noinline
2321
2321
17 e = b + c;
@@ -2337,8 +2337,8 @@ second add is also used in the third add. These are examples of data
2337
2337
dependencies as later adds use the data result of previous adds. Because
2338
2338
we must wait for the result `e` to be produced before we can compute `f`,
2339
2339
and then the result `f` must be produced before we can compute `g`, not all
2340
- instructions can be scheduled immediately. They must wait for their
2341
- dependent instructions to finish executing before they can start, or
2340
+ instructions can be scheduled immediately. They must wait for the instructions
2341
+ they depend on to finish executing before they can start, or
2342
2342
they would produce the wrong result.
2343
2343
2344
2344
<p align="center"><img src=".//media/image70.png" /></br>
@@ -2375,7 +2375,7 @@ memories.
2375
2375
2376
2376
<p align="center"><img src=".//media/image72.png" /></p>
2377
2377
2378
- ```
2378
+ ```c++
2379
2379
22 void memory_dependency() {
2380
2380
23 #pragma HLS function noinline
2381
2381
24 volatile int i = 0;
@@ -2419,7 +2419,7 @@ resource cannot be scheduled in parallel due to a lack of resources.
2419
2419
` resource_contention ` function on line 30 of
2420
2420
` instruction_level_parallelism.cpp ` .
2421
2421
2422
- ```
2422
+ ``` c++
2423
2423
30 void resource_contention () {
2424
2424
31 #pragma HLS function noinline
2425
2425
32 e = a[ 0] ;
@@ -2452,7 +2452,7 @@ when generating the schedule for a design.
2452
2452
Next, we will see an example of how loops prevent operations from being
2453
2453
scheduled in parallel.
2454
2454
2455
- ```
2455
+ ```c++
2456
2456
37 void no_loop_unroll() {
2457
2457
38 #pragma HLS function noinline
2458
2458
39 int h = 0;
@@ -2481,10 +2481,9 @@ has no unrolling on the loop and `loop_unroll` unrolls the loop
2481
2481
completely. This affects the resulting hardware by removing the control
2482
2482
signals needed to facilitate the loop and combining multiple loop bodies
2483
2483
into the same basic block, allowing more instructions to be scheduled in
2484
- parallel. The trade-off here is an unrolled loop does not reuse hardware
2485
- resources and can potentially use a lot of resources. However, the
2486
- unrolled loop would finish earlier depending on how inherently parallel
2487
- the loop body is.
2484
+ parallel. The trade-off here is that an unrolled loop does not reuse hardware
2485
+ resources and can potentially use a lot of resources, however it will
2486
+ finish earlier depending on how inherently parallel the loop body is.
2488
2487
2489
2488
![ ] ( .//media/image3.png ) To see the effects of this, open the Schedule
2490
2489
Viewer and first click on the ` no_loop_unroll ` function shown in Figure
0 commit comments