1# Design of DepthwiseConv2D for VexRISCV
2
3*   Author: Daniel You (Google SWE intern, Summer 2020)
4*   Github Profile: [danielyou0230](https://github.com/danielyou0230)
5*   Last Update: August 28, 2020
6*   [PR#42715](https://github.com/tensorflow/tensorflow/pull/42715) (see
7    experiment results in the PR message)
8
9## Overview
10
11The kernel is optimized based on the reference kernel in Tensorflow Lite.
12Different from the straightforward implementation, this implementation takes
13memory layout in TF Lite (`NHWC`) into account, which leverages memory hierarchy
14to reduce memory miss count, to be more specific, it performs depthwise
15convolution for every channel in a fixed spatial position (iterate `C`-axis
16first, then `W`-axis, `H`-axis, and `N`-axis).
17
18## Objective
19
20With the debut of Artificial Intelligence (AI) products and services, our lives
21have been changed ever since. While much of those applications are cloud-based
22implementations, there are still many cases where AI algorithms have to be run
23on resource constrained devices. Current machine learning frameworks are still
24not well optimized for those platforms, thereby preventing more complicated
25applications running on them with acceptable performance.
26
27This design focuses on improving the performance of kernels in TensorFlow Lite
28Micro, to be more specific, this design involves one of the most popular kernels
29among the models deployed on edge devices: DepthwiseConv2D (see
30[TensorFlow Python API](https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D);
31[discussion on MobileNetV1](https://groups.google.com/g/keras-users/c/sec8pYjJwwE)
32on Google groups.) The goal is to reduce the inference time on those devices
33which can in turn save more energy on them or more importantly, enable more
34complex applications running on them.
35
36## Background
37
38Existing works aim to optimize on-CPU performance focus on leveraging CPU
39specific instruction like SIMD instructions in RISC-V Vector and other
40counterparts like AVX and SSE intrinsics. An implementation released by Facebook
41([pytorch/FBGEMM](https://github.com/pytorch/FBGEMM/tree/master/src))
42demonstrated the potential that can be achieved with the aforementioned vector
43instructions.
44
45The alternative approach is to optimize on GPUs. Modern GPUs are well-known for
46having great performance in matrix multiplication and parallel computation (e.g.
47CUDA from Nvidia). Those powerful GPUs enable machine learning researchers to
48explore a wide variety of models and solve complicated problems. For resource
49constrained embedded processors, however, incorporating a GPU may not fit the
50limited hardware and power budget for their applications. Unlike running
51TensorFlow Python APIs on desktop or servers, TensorFlow Lite and TensorFlow
52Lite Micro are made to efficiently run inference on those devices, which enables
53the possibilities to make machine learning applications ubiquitous in our life.
54
55## Requirements and scale
56
57After detailed analysis on memory access patterns in existing implementations, I
58found existing code under-utilizes the memory hierarchy, specifically, the SRAM
59cache, to reduce excessive memory access time, which would be approximately 100
60times slower if memory access were optimized
61([Latency Numbers Every Programmer Should Know](https://gist.github.com/jboner/2841832),
62[The Effect Of CPU Caches And Memory Access Patterns](http://kejser.org/the-effect-of-cpu-caches-and-memory-access-patterns/).)
63Therefore, this design aims to improve the memory access pattern to better fit
64the memory layout of the TensorFlow Lite C++ library. Any integer-based models
65with DepthwiseConv2D layers using TensorFlow Lite Micro will benefit from this
66change.
67
68To begin with, the memory layout of tensors in TensorFlow Lite C++ library uses
69`NHWC` format `(n, height, width, channel)` and flattened to an 1-d tensor, the
70index of `(n, h, w, c)` in the tensor can then be calculated with `((n * H + h)
71* W + w) * C + c`. The reference implementation is depicted as follows:
72
73```
74for i-th input among N inputs
75  for c-th input channel
76    for (y, x) in input that are convolving with the filter
77      access element (i, y, x, c) in the input
78```
79
80Thus, if the current element is `(i, y, x, c)` at index `((i * H + y) * W + x) *
81C + c`, next element will be `(i, y, x + 1, c)` at index `((i * H + y) * W + (x
82+ 1)) * C + c`, the difference of indices between two consecutive accesses is
83`C` (illustrated below,) which is apparently not a sequential access.
84
85![dconv_arr_index](https://user-images.githubusercontent.com/21079720/91612344-0f25e600-e932-11ea-9278-7c8161748711.png)
86
87In response to the poor memory access pattern in the reference, it would be
88beneficial to implement DepthwiseConv2D in a depth-centric manner, namely,
89accessing elements at a fixed spatial location `(y, x)` for each channel. The
90access order then becomes sequential on the 1-d tensor because the layout of
91tensors are in the format of `NHWC`.
92
93## Design ideas
94
95Instead of accessing the memory in a non-sequential manner, this design proposes
96to change the access pattern to be consistent with the memory layout in the
97current TensorFlow Lite C++ library. The idea can be broken down into two major
98parts:
99
100*   Relating sequential memory access to DepthwiseConv2D
101*   Depthwise convolution with sequential memory access scheme
102
103### Relating sequential memory access to DepthwiseConv2D
104
105Contrary to the reference implementation, the proposed solution re-orders the
106calculation to access the elements sequentially in the tensor, namely, `(0, 1,
1072, ..., H * W * C - 1)`. This can be done by interchanging the order of two
108inner loops: `for i-th input for (y, x) in input that are convolving with the
109filter for c-th input channel access element (i, y, x, c) in the input`
110
111In this case, if the current element is `(i, y, x, c)` at index `((i * H + y) *
112W + x) * C + c`, the next element will be `((i * H + y) * W + x) * C + (c + 1)`,
113the difference of between two consecutive access becomes `1`, thereby fully
114re-using the data in a cache block.
115
116### Depthwise convolution with sequential memory access scheme
117
118In the existing TF Lite reference implementation, each element in the output is
119calculated by performing `(filter_h * filter_w)` multiplications and additions
120in a row. With the proposed design, memory access patterns can be greatly
121improved by re-ordering the calculations.
122
123Rather than calculating the results in a row, this design rearranges the
124operations. To calculate the output at a specific spatial location for all
125channels (see the colored cells in the output tensor in the figure below) the
126resulting order of calculations is illustrated below, the involving input/filter
127locations are represented as `(spatial index, channel)`
128
129![dconv_org_vis](https://user-images.githubusercontent.com/21079720/91612427-409eb180-e932-11ea-9c30-a205c8f3e461.png)
130
131The calculation for each element at the output is completed when it reaches the
132bold coordinates in the table. From the table, this scheme only gets partial
133results until it reaches the last location (i.e., `(#9, 0)` to `(#9, C-1)`).
134Ideally, we can use the output tensor directly as an accumulator, no extra space
135is needed at runtime. Yet, since the output tensor is limited (8 bits) in an
136integer model, accumulating intermediate values at the output tensor will cause
137overflow: the product of two `int8` values is in the range of `int16` and there
138are `H * W` values to be accumulated, the range of the value before quantization
139is `H * W * MAX_INT16`. Therefore, an `int32` accumulator is adequate as long as
140the number of accumulations `(H*W*C)` does not exceed `2^16`. To address
141overflow when accumulating at output tensor and provide better memory access
142pattern, an `int32` array of size equals to number of channels (`C`) as
143accumulators is enough, since those `C` calculations are done once a set of
144spatial locations (`#1` to `#9`) are convolved, we don't have to allocate an
145array with size equals to the output tensor to accumulate the values.
146
147Original        | Optimized
148:-------------: | :-------------:
149(#1, 0)         | (#1, 0)
150(#2, 0)         | (#1, 1)
151...             | ...
152**(#9, 0)**     | (#1, C - 1)
153(#1, 1)         | (#2, 0)
154...             | ...
155**(#9, 1)**     | **(#9, 0)**
156...             | ...
157**(#9, C - 1)** | **(#9, C - 1)**
158
159If we implement this idea, i.e. allocating a temporary array with size equals to
160`C`, we can follow the loop structure shown below, this would work just fine,
161but as we can see in the routine, it involves allocating an `int32` array of
162**size in proportional to the input channel**, which is not preferable in those
163resource limited devices because we cannot assure there will always be enough
164memory given any application or model.
165
166```
167for i-th input among N inputs
168  for each (out_y, out_x)
169    for m < depth_multiplier; step_size = 1
170      calculate origin (in_y_origin, in_x_origin) to perform convolution
171
172      // Accumulate partial results in buffer given a origin
173      create an int32 buffer of size output_channel as accumulators
174
175      for each (filter_y, filter_x)
176        calculate (in_y, in_x) to perform convolution
177        for in_ch < in_channel; step_size = 1
178          calculate out_ch
179          // accumulate partial results
180          buffer[ch_offset] += input[indexOf(i, y, x, in_ch)] *
181                               filter[indexOf(0, f_y, f_x, out_ch)]
182
183      for in_ch < in_channel; step_size = 1
184        calculate out_ch
185        // Add bias / activation / requantize
186        value = postAccumulation(buffer[out_ch])
187        output[indexOf(i, out_y, out_x, out_ch)] = value
188```
189
190Instead, we can further breakdown the structure into chunks, namely, we can add
191an additional nested loop inside to iterate `K` channels a time until all
192channels are processed, the modified loop structure is depicted below and the
193visualization is shown in the figure below the loop.
194
195```
196for i-th input among N inputs
197  for each (out_y, out_x)
198    for m < depth_multiplier; step_size = 1
199      calculate origin (in_y_origin, in_x_origin) to perform convolution
200
201      // Accumulate partial results in buffer for K channels given a origin
202      for ch < input_ch; step_size = K
203        create an int32 buffer of size K as accumulator for current chunk
204
205        for each (filter_y, filter_x)
206          calculate (in_y, in_x) to perform convolution
207          for ch_offset < channel_step; step_size = 1
208            calculate in_ch and out_ch
209            // accumulate partial results
210            buffer[ch_offset] += input[indexOf(i, y, x, in_ch)] *
211                                 filter[indexOf(0, f_y, f_x, out_ch)]
212
213        for ch_offset < channel_step; step_size = 1
214          // Add bias / activation / requantize
215          value = postAccumulation(buffer[ch_offset])
216          output[indexOf(i, out_y, out_x, out_ch)] = value
217```
218
219![dconv_design_vis](https://user-images.githubusercontent.com/21079720/91612374-2369e300-e932-11ea-90eb-898c0270794e.png)
220
221The final problem is how the choice of `K`, according to the soft-CPU
222configuration, we have a cache size of 4KB and each memory block is 32 bytes.
223Combined with the input format we use (`int8`) whenever the OS fetches a block
224of input tensor, it loads 32 `int8` to the cache. To fully utilize that block,
225we can choose the size of the buffer to accommodate 32 partial results (128
226byte, or 4 blocks,) most applications keep the number of channels to be power of
2272s (except for the input,) 32 is a reasonable value to perform depthwise
228convolution for both small and large numbers of channels in the model.
229
230## Alternatives considered
231
232An alternative design is to dynamically allocate a buffer for each channel (an
233`int32` array of size equals to number of output channels.) This approach is
234easier to implement since after `H * W * C` calculations, we can requantize
235those `C` values and store them into the output tensor. However, we are running
236on memory constrained devices, dynamic allocation is not encouraged by the
237upstream developers.
238