1<!-- mdformat off(b/169948621#comment2) -->
2# TensorFlow Lite for Microcontrollers Port of 16x8 Quantized Operators
3
4| Status        | Proposed                                                    |
5:-------------- |:----------------------------------------------------------- |
6| **RFC #2**    | [46767](https://github.com/tensorflow/tensorflow/pull/46767)|
7| **Author(s)** | Daniel Situnayake (dan@edgeimpulse.com)                     |
8| **Sponsor**   | Pete Warden (petewarden@google.com)                         |
9| **Updated**   | 2021-01-28                                                  |
10
11## Objective
12
13TensorFlow Lite has kernel implementations that support 8 bit quantized weights
14but use 16 bit activations. We wish to port these implementations to TensorFlow
15Lite for Microcontrollers. The increased precision available for activations can
16improve performance for some quantized models.
17
18Arm have agreed to support the initiative by adding the necessary 16x8 APIs to
19CMSIS-NN and porting the CMSIS-NN kernels.
20
21### Goals
22- Port a subset of 16x8 reference kernels from TensorFlow Lite to TensorFlow Lite Micro
23- Avoid increasing default code size or arena size of TensorFlow Lite Micro
24- Lay the groundwork for creating a CMSIS-NN port of the 16x8 kernels
25
26### Non-goals
27- Port every single operator to 16x8; we only plan to port a subset of those with existing reference implementations
28
29## Motivation
30
31Some networks that suffer unacceptable degradation when quantized with 8 bit weights
32and 8 bit activations perform adequately when quantized with 8 bit weights and 16
33bit activations. The [TensorFlow Lite documentation](https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8) states the following:
34
35> [16x8 quantization] mode can improve accuracy of the quantized model significantly, when activations are sensitive to the quantization, while still achieving almost 3-4x reduction in model size. Moreover, this fully quantized model can be consumed by integer-only hardware accelerators.
36
37Edge Impulse, a company that deploys TensorFlow Lite for Microcontrollers as part of its embedded
38machine learning pipeline, has gathered feedback from customers with production models for which 8 bit
39quantization results in unacceptable degradation but for whom 16x8 is fine.
40
41While 16x8 quantization is well supported within TensorFlow Lite, it is not currently supported
42within TensorFlow Lite for Microcontrollers. Porting the TensorFlow Lite reference kernels is
43relatively straightforward and will improve adoption of TensorFlow Lite for Microcontrollers with users
44for whom degradation is too severe with full 8 bit quantization.
45
46## User Benefit
47
48The headline would be "16x8 kernels improve accuracy for quantized models on microcontrollers without
49increasing model size".
50
51Users would benefit in the following ways:
52
53- Improved accuracy for quantized models without increasing model size (in exchange for additional
54  runtime memory usage)
55- Improved performance under certain conditions (for example, 16x8 CMSIS-NN kernels will run faster)
56  than 8 bit kernels since less unpacking is required)
57
58## Design Proposal
59
60We propose that the 16x8 kernels are ported from the TensorFlow Lite reference kernels to
61TensorFlow Lite for Microcontrollers following the process in the [Porting TensorFlow Lite Ops to Micro](https://docs.google.com/document/d/1KLJTPWm4TUKB9YyIqFJl9VCP0ZMJDt_P8RNpRmwqMxw/edit#heading=h.5x0d5h95i329)
62guide.
63
64We wish to ensure that the following kernels are compatible with 16x8 mode:
65
66- Conv2D
67- MaxPool2D
68- DepthwiseConv2D
69- FullyConnected
70- Relu
71- Relu6
72- Tanh
73- Softmax
74- Pad
75- Reshape
76- Pack
77- Unpack
78- Add
79- Mul
80
81Adding the 16x8 kernels directly to TFLM alongside the existing kernels would increase the default code size by an unacceptable amount. Instead, we will make use of the kernel registration API currently under development by the TFLM team. The use of this is demonstrated in the
82[Keyword benchmark code](https://github.com/tensorflow/tensorflow/blob/a30d20b632b4ffbfd437ccf8ee205fef0917a3eb/tensorflow/lite/micro/benchmarks/keyword_benchmark.cc#L56).
83By doing this, the end user can decide which kernels and dependencies they want to include (e.g. 8 bit, 16x8,
84or float32).
85
86For example, the following could be registered:
87
88```
89// Support for all datatypes
90op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED);
91// Support for 8 bit quantized models
92op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT8);
93// Support for 16x8 quantized models
94op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT16X8());
95```
96
97This means that kernels not currently using this registration API will need to be refactored to use it. Currently only **FullyConnected** uses the API.
98
99The following associated tasks will be required to support this work:
100
101- Build or port unit tests for the new kernels
102- Prove that code memory is not impacted by running benchmarks before and after the port
103
104### Alternatives Considered
105* An alternative would be to add the 16x8 kernels without using the new kernel registration API, but this would
106  result in a major increase in code size.
107
108### Performance Implications
109- Impact on memory usage for current modes (int8 and float32) will be minimal. This will be confirmed by
110  benchmarking of current performance against performance of the submitted changes.
111- When 16x8 mode is used, RAM usage will be approximately 2x. Latency may change depending on the target
112  platform.
113- End to end and unit tests will be updated to prove that the new implementations are operating correctly.
114
115### Dependencies
116- No additional dependencies will be added to TensorFlow
117- No other parts of TensorFlow will be affected
118
119### Engineering Impact
120- Impact on binary size should be minimal
121- Test times may increase due to additional kernel unit tests
122- The reference kernels already exist within TensorFlow Lite so there will be minimal additional maintenance
123
124### Platforms and Environments
125- The proposed changes will work on all currently supported platforms
126
127### Best Practices
128- TensorFlow Lite for Microcontrollers should be updated to indicate that 16x8 kernels are now available
129
130### Tutorials and Examples
131- A benchmark will be added to [`tensorflow/lite/micro/benchmarks`](https://github.com/tensorflow/tensorflow/tree/975335bc83bf3cb80a71a04ed407725508709808/tensorflow/lite/micro/benchmarks) that demonstrates the use of the ops that provide a 16x8 kernel.
132- A Colab will be created that demonstrates quantizing a model in 16x8 mode and exporting it as a C header file for use with TensorFlow Lite for Microcontrollers
133
134### Compatibility
135- This work will improve compatibility and feature parity between TensorFlow Lite and TensorFlow Lite for Microcontrollers
136
137### User Impact
138- Since TFLM does not have a versioning system the feature can be rolled out as any other commit
139
140## Implementation plan
141
142The work will be broken down into a series of pull requests, some for the benchmarks and some for each kernel.
143
144Benchmark pull requests:
145- PR1: Create a new benchmark in [`tensorflow/lite/micro/benchmarks`](https://github.com/tensorflow/tensorflow/tree/975335bc83bf3cb80a71a04ed407725508709808/tensorflow/lite/micro/benchmarks) that attempts to run a 16x8 model that includes the kernels mentioned in this RFC. The model’s weights and biases can be random. The benchmark should use the MicroMutableOpResolver. The PR should include the Colab used to generate the model.
146- PR2: Port the person_detection and keyword benchmarks to use the MicroMutableOpResolver.
147- PR3: Add code to both benchmarks that prints the arena size using the [`RecordingMemoryAllocator`](https://github.com/tensorflow/tensorflow/blob/ee87d58a6504375c28f21ea303f0eefa29118c38/tensorflow/lite/micro/docs/memory_management.md#recording-memory-apis).
148
149For each kernel:
150- PR1: Refactor the implementation to support the new kernel variant registration API.
151- PR2: Add 16x8 support and make sure that the benchmark binary and arena sizes are unchanged.
152
153Note that @njeffrie from the TF Lite Micro team also plans to prepare PR(s) for the kernels that are of interest internally
154(without using the kernel variant registation API for binary size). This will provide some quick examples of porting the kernels.
155
156## Questions and Discussion Topics
157