1# Design of DepthwiseConv2D for VexRISCV 2 3* Author: Daniel You (Google SWE intern, Summer 2020) 4* Github Profile: [danielyou0230](https://github.com/danielyou0230) 5* Last Update: August 28, 2020 6* [PR#42715](https://github.com/tensorflow/tensorflow/pull/42715) (see 7 experiment results in the PR message) 8 9## Overview 10 11The kernel is optimized based on the reference kernel in Tensorflow Lite. 12Different from the straightforward implementation, this implementation takes 13memory layout in TF Lite (`NHWC`) into account, which leverages memory hierarchy 14to reduce memory miss count, to be more specific, it performs depthwise 15convolution for every channel in a fixed spatial position (iterate `C`-axis 16first, then `W`-axis, `H`-axis, and `N`-axis). 17 18## Objective 19 20With the debut of Artificial Intelligence (AI) products and services, our lives 21have been changed ever since. While much of those applications are cloud-based 22implementations, there are still many cases where AI algorithms have to be run 23on resource constrained devices. Current machine learning frameworks are still 24not well optimized for those platforms, thereby preventing more complicated 25applications running on them with acceptable performance. 26 27This design focuses on improving the performance of kernels in TensorFlow Lite 28Micro, to be more specific, this design involves one of the most popular kernels 29among the models deployed on edge devices: DepthwiseConv2D (see 30[TensorFlow Python API](https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D); 31[discussion on MobileNetV1](https://groups.google.com/g/keras-users/c/sec8pYjJwwE) 32on Google groups.) The goal is to reduce the inference time on those devices 33which can in turn save more energy on them or more importantly, enable more 34complex applications running on them. 35 36## Background 37 38Existing works aim to optimize on-CPU performance focus on leveraging CPU 39specific instruction like SIMD instructions in RISC-V Vector and other 40counterparts like AVX and SSE intrinsics. An implementation released by Facebook 41([pytorch/FBGEMM](https://github.com/pytorch/FBGEMM/tree/master/src)) 42demonstrated the potential that can be achieved with the aforementioned vector 43instructions. 44 45The alternative approach is to optimize on GPUs. Modern GPUs are well-known for 46having great performance in matrix multiplication and parallel computation (e.g. 47CUDA from Nvidia). Those powerful GPUs enable machine learning researchers to 48explore a wide variety of models and solve complicated problems. For resource 49constrained embedded processors, however, incorporating a GPU may not fit the 50limited hardware and power budget for their applications. Unlike running 51TensorFlow Python APIs on desktop or servers, TensorFlow Lite and TensorFlow 52Lite Micro are made to efficiently run inference on those devices, which enables 53the possibilities to make machine learning applications ubiquitous in our life. 54 55## Requirements and scale 56 57After detailed analysis on memory access patterns in existing implementations, I 58found existing code under-utilizes the memory hierarchy, specifically, the SRAM 59cache, to reduce excessive memory access time, which would be approximately 100 60times slower if memory access were optimized 61([Latency Numbers Every Programmer Should Know](https://gist.github.com/jboner/2841832), 62[The Effect Of CPU Caches And Memory Access Patterns](http://kejser.org/the-effect-of-cpu-caches-and-memory-access-patterns/).) 63Therefore, this design aims to improve the memory access pattern to better fit 64the memory layout of the TensorFlow Lite C++ library. Any integer-based models 65with DepthwiseConv2D layers using TensorFlow Lite Micro will benefit from this 66change. 67 68To begin with, the memory layout of tensors in TensorFlow Lite C++ library uses 69`NHWC` format `(n, height, width, channel)` and flattened to an 1-d tensor, the 70index of `(n, h, w, c)` in the tensor can then be calculated with `((n * H + h) 71* W + w) * C + c`. The reference implementation is depicted as follows: 72 73``` 74for i-th input among N inputs 75 for c-th input channel 76 for (y, x) in input that are convolving with the filter 77 access element (i, y, x, c) in the input 78``` 79 80Thus, if the current element is `(i, y, x, c)` at index `((i * H + y) * W + x) * 81C + c`, next element will be `(i, y, x + 1, c)` at index `((i * H + y) * W + (x 82+ 1)) * C + c`, the difference of indices between two consecutive accesses is 83`C` (illustrated below,) which is apparently not a sequential access. 84 85![dconv_arr_index](https://user-images.githubusercontent.com/21079720/91612344-0f25e600-e932-11ea-9278-7c8161748711.png) 86 87In response to the poor memory access pattern in the reference, it would be 88beneficial to implement DepthwiseConv2D in a depth-centric manner, namely, 89accessing elements at a fixed spatial location `(y, x)` for each channel. The 90access order then becomes sequential on the 1-d tensor because the layout of 91tensors are in the format of `NHWC`. 92 93## Design ideas 94 95Instead of accessing the memory in a non-sequential manner, this design proposes 96to change the access pattern to be consistent with the memory layout in the 97current TensorFlow Lite C++ library. The idea can be broken down into two major 98parts: 99 100* Relating sequential memory access to DepthwiseConv2D 101* Depthwise convolution with sequential memory access scheme 102 103### Relating sequential memory access to DepthwiseConv2D 104 105Contrary to the reference implementation, the proposed solution re-orders the 106calculation to access the elements sequentially in the tensor, namely, `(0, 1, 1072, ..., H * W * C - 1)`. This can be done by interchanging the order of two 108inner loops: `for i-th input for (y, x) in input that are convolving with the 109filter for c-th input channel access element (i, y, x, c) in the input` 110 111In this case, if the current element is `(i, y, x, c)` at index `((i * H + y) * 112W + x) * C + c`, the next element will be `((i * H + y) * W + x) * C + (c + 1)`, 113the difference of between two consecutive access becomes `1`, thereby fully 114re-using the data in a cache block. 115 116### Depthwise convolution with sequential memory access scheme 117 118In the existing TF Lite reference implementation, each element in the output is 119calculated by performing `(filter_h * filter_w)` multiplications and additions 120in a row. With the proposed design, memory access patterns can be greatly 121improved by re-ordering the calculations. 122 123Rather than calculating the results in a row, this design rearranges the 124operations. To calculate the output at a specific spatial location for all 125channels (see the colored cells in the output tensor in the figure below) the 126resulting order of calculations is illustrated below, the involving input/filter 127locations are represented as `(spatial index, channel)` 128 129![dconv_org_vis](https://user-images.githubusercontent.com/21079720/91612427-409eb180-e932-11ea-9c30-a205c8f3e461.png) 130 131The calculation for each element at the output is completed when it reaches the 132bold coordinates in the table. From the table, this scheme only gets partial 133results until it reaches the last location (i.e., `(#9, 0)` to `(#9, C-1)`). 134Ideally, we can use the output tensor directly as an accumulator, no extra space 135is needed at runtime. Yet, since the output tensor is limited (8 bits) in an 136integer model, accumulating intermediate values at the output tensor will cause 137overflow: the product of two `int8` values is in the range of `int16` and there 138are `H * W` values to be accumulated, the range of the value before quantization 139is `H * W * MAX_INT16`. Therefore, an `int32` accumulator is adequate as long as 140the number of accumulations `(H*W*C)` does not exceed `2^16`. To address 141overflow when accumulating at output tensor and provide better memory access 142pattern, an `int32` array of size equals to number of channels (`C`) as 143accumulators is enough, since those `C` calculations are done once a set of 144spatial locations (`#1` to `#9`) are convolved, we don't have to allocate an 145array with size equals to the output tensor to accumulate the values. 146 147Original | Optimized 148:-------------: | :-------------: 149(#1, 0) | (#1, 0) 150(#2, 0) | (#1, 1) 151... | ... 152**(#9, 0)** | (#1, C - 1) 153(#1, 1) | (#2, 0) 154... | ... 155**(#9, 1)** | **(#9, 0)** 156... | ... 157**(#9, C - 1)** | **(#9, C - 1)** 158 159If we implement this idea, i.e. allocating a temporary array with size equals to 160`C`, we can follow the loop structure shown below, this would work just fine, 161but as we can see in the routine, it involves allocating an `int32` array of 162**size in proportional to the input channel**, which is not preferable in those 163resource limited devices because we cannot assure there will always be enough 164memory given any application or model. 165 166``` 167for i-th input among N inputs 168 for each (out_y, out_x) 169 for m < depth_multiplier; step_size = 1 170 calculate origin (in_y_origin, in_x_origin) to perform convolution 171 172 // Accumulate partial results in buffer given a origin 173 create an int32 buffer of size output_channel as accumulators 174 175 for each (filter_y, filter_x) 176 calculate (in_y, in_x) to perform convolution 177 for in_ch < in_channel; step_size = 1 178 calculate out_ch 179 // accumulate partial results 180 buffer[ch_offset] += input[indexOf(i, y, x, in_ch)] * 181 filter[indexOf(0, f_y, f_x, out_ch)] 182 183 for in_ch < in_channel; step_size = 1 184 calculate out_ch 185 // Add bias / activation / requantize 186 value = postAccumulation(buffer[out_ch]) 187 output[indexOf(i, out_y, out_x, out_ch)] = value 188``` 189 190Instead, we can further breakdown the structure into chunks, namely, we can add 191an additional nested loop inside to iterate `K` channels a time until all 192channels are processed, the modified loop structure is depicted below and the 193visualization is shown in the figure below the loop. 194 195``` 196for i-th input among N inputs 197 for each (out_y, out_x) 198 for m < depth_multiplier; step_size = 1 199 calculate origin (in_y_origin, in_x_origin) to perform convolution 200 201 // Accumulate partial results in buffer for K channels given a origin 202 for ch < input_ch; step_size = K 203 create an int32 buffer of size K as accumulator for current chunk 204 205 for each (filter_y, filter_x) 206 calculate (in_y, in_x) to perform convolution 207 for ch_offset < channel_step; step_size = 1 208 calculate in_ch and out_ch 209 // accumulate partial results 210 buffer[ch_offset] += input[indexOf(i, y, x, in_ch)] * 211 filter[indexOf(0, f_y, f_x, out_ch)] 212 213 for ch_offset < channel_step; step_size = 1 214 // Add bias / activation / requantize 215 value = postAccumulation(buffer[ch_offset]) 216 output[indexOf(i, out_y, out_x, out_ch)] = value 217``` 218 219![dconv_design_vis](https://user-images.githubusercontent.com/21079720/91612374-2369e300-e932-11ea-90eb-898c0270794e.png) 220 221The final problem is how the choice of `K`, according to the soft-CPU 222configuration, we have a cache size of 4KB and each memory block is 32 bytes. 223Combined with the input format we use (`int8`) whenever the OS fetches a block 224of input tensor, it loads 32 `int8` to the cache. To fully utilize that block, 225we can choose the size of the buffer to accommodate 32 partial results (128 226byte, or 4 blocks,) most applications keep the number of channels to be power of 2272s (except for the input,) 32 is a reasonable value to perform depthwise 228convolution for both small and large numbers of channels in the model. 229 230## Alternatives considered 231 232An alternative design is to dynamically allocate a buffer for each channel (an 233`int32` array of size equals to number of output channels.) This approach is 234easier to implement since after `H * W * C` calculations, we can requantize 235those `C` values and store them into the output tensor. However, we are running 236on memory constrained devices, dynamic allocation is not encouraged by the 237upstream developers. 238