1<!-- mdformat off(b/169948621#comment2) --> 2# TensorFlow Lite for Microcontrollers Port of 16x8 Quantized Operators 3 4| Status | Proposed | 5:-------------- |:----------------------------------------------------------- | 6| **RFC #2** | [46767](https://github.com/tensorflow/tensorflow/pull/46767)| 7| **Author(s)** | Daniel Situnayake (dan@edgeimpulse.com) | 8| **Sponsor** | Pete Warden (petewarden@google.com) | 9| **Updated** | 2021-01-28 | 10 11## Objective 12 13TensorFlow Lite has kernel implementations that support 8 bit quantized weights 14but use 16 bit activations. We wish to port these implementations to TensorFlow 15Lite for Microcontrollers. The increased precision available for activations can 16improve performance for some quantized models. 17 18Arm have agreed to support the initiative by adding the necessary 16x8 APIs to 19CMSIS-NN and porting the CMSIS-NN kernels. 20 21### Goals 22- Port a subset of 16x8 reference kernels from TensorFlow Lite to TensorFlow Lite Micro 23- Avoid increasing default code size or arena size of TensorFlow Lite Micro 24- Lay the groundwork for creating a CMSIS-NN port of the 16x8 kernels 25 26### Non-goals 27- Port every single operator to 16x8; we only plan to port a subset of those with existing reference implementations 28 29## Motivation 30 31Some networks that suffer unacceptable degradation when quantized with 8 bit weights 32and 8 bit activations perform adequately when quantized with 8 bit weights and 16 33bit activations. The [TensorFlow Lite documentation](https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8) states the following: 34 35> [16x8 quantization] mode can improve accuracy of the quantized model significantly, when activations are sensitive to the quantization, while still achieving almost 3-4x reduction in model size. Moreover, this fully quantized model can be consumed by integer-only hardware accelerators. 36 37Edge Impulse, a company that deploys TensorFlow Lite for Microcontrollers as part of its embedded 38machine learning pipeline, has gathered feedback from customers with production models for which 8 bit 39quantization results in unacceptable degradation but for whom 16x8 is fine. 40 41While 16x8 quantization is well supported within TensorFlow Lite, it is not currently supported 42within TensorFlow Lite for Microcontrollers. Porting the TensorFlow Lite reference kernels is 43relatively straightforward and will improve adoption of TensorFlow Lite for Microcontrollers with users 44for whom degradation is too severe with full 8 bit quantization. 45 46## User Benefit 47 48The headline would be "16x8 kernels improve accuracy for quantized models on microcontrollers without 49increasing model size". 50 51Users would benefit in the following ways: 52 53- Improved accuracy for quantized models without increasing model size (in exchange for additional 54 runtime memory usage) 55- Improved performance under certain conditions (for example, 16x8 CMSIS-NN kernels will run faster) 56 than 8 bit kernels since less unpacking is required) 57 58## Design Proposal 59 60We propose that the 16x8 kernels are ported from the TensorFlow Lite reference kernels to 61TensorFlow Lite for Microcontrollers following the process in the [Porting TensorFlow Lite Ops to Micro](https://docs.google.com/document/d/1KLJTPWm4TUKB9YyIqFJl9VCP0ZMJDt_P8RNpRmwqMxw/edit#heading=h.5x0d5h95i329) 62guide. 63 64We wish to ensure that the following kernels are compatible with 16x8 mode: 65 66- Conv2D 67- MaxPool2D 68- DepthwiseConv2D 69- FullyConnected 70- Relu 71- Relu6 72- Tanh 73- Softmax 74- Pad 75- Reshape 76- Pack 77- Unpack 78- Add 79- Mul 80 81Adding the 16x8 kernels directly to TFLM alongside the existing kernels would increase the default code size by an unacceptable amount. Instead, we will make use of the kernel registration API currently under development by the TFLM team. The use of this is demonstrated in the 82[Keyword benchmark code](https://github.com/tensorflow/tensorflow/blob/a30d20b632b4ffbfd437ccf8ee205fef0917a3eb/tensorflow/lite/micro/benchmarks/keyword_benchmark.cc#L56). 83By doing this, the end user can decide which kernels and dependencies they want to include (e.g. 8 bit, 16x8, 84or float32). 85 86For example, the following could be registered: 87 88``` 89// Support for all datatypes 90op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED); 91// Support for 8 bit quantized models 92op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT8); 93// Support for 16x8 quantized models 94op_resolver->AddFullyConnected(tflite::Register_FULLY_CONNECTED_INT16X8()); 95``` 96 97This means that kernels not currently using this registration API will need to be refactored to use it. Currently only **FullyConnected** uses the API. 98 99The following associated tasks will be required to support this work: 100 101- Build or port unit tests for the new kernels 102- Prove that code memory is not impacted by running benchmarks before and after the port 103 104### Alternatives Considered 105* An alternative would be to add the 16x8 kernels without using the new kernel registration API, but this would 106 result in a major increase in code size. 107 108### Performance Implications 109- Impact on memory usage for current modes (int8 and float32) will be minimal. This will be confirmed by 110 benchmarking of current performance against performance of the submitted changes. 111- When 16x8 mode is used, RAM usage will be approximately 2x. Latency may change depending on the target 112 platform. 113- End to end and unit tests will be updated to prove that the new implementations are operating correctly. 114 115### Dependencies 116- No additional dependencies will be added to TensorFlow 117- No other parts of TensorFlow will be affected 118 119### Engineering Impact 120- Impact on binary size should be minimal 121- Test times may increase due to additional kernel unit tests 122- The reference kernels already exist within TensorFlow Lite so there will be minimal additional maintenance 123 124### Platforms and Environments 125- The proposed changes will work on all currently supported platforms 126 127### Best Practices 128- TensorFlow Lite for Microcontrollers should be updated to indicate that 16x8 kernels are now available 129 130### Tutorials and Examples 131- A benchmark will be added to [`tensorflow/lite/micro/benchmarks`](https://github.com/tensorflow/tensorflow/tree/975335bc83bf3cb80a71a04ed407725508709808/tensorflow/lite/micro/benchmarks) that demonstrates the use of the ops that provide a 16x8 kernel. 132- A Colab will be created that demonstrates quantizing a model in 16x8 mode and exporting it as a C header file for use with TensorFlow Lite for Microcontrollers 133 134### Compatibility 135- This work will improve compatibility and feature parity between TensorFlow Lite and TensorFlow Lite for Microcontrollers 136 137### User Impact 138- Since TFLM does not have a versioning system the feature can be rolled out as any other commit 139 140## Implementation plan 141 142The work will be broken down into a series of pull requests, some for the benchmarks and some for each kernel. 143 144Benchmark pull requests: 145- PR1: Create a new benchmark in [`tensorflow/lite/micro/benchmarks`](https://github.com/tensorflow/tensorflow/tree/975335bc83bf3cb80a71a04ed407725508709808/tensorflow/lite/micro/benchmarks) that attempts to run a 16x8 model that includes the kernels mentioned in this RFC. The model’s weights and biases can be random. The benchmark should use the MicroMutableOpResolver. The PR should include the Colab used to generate the model. 146- PR2: Port the person_detection and keyword benchmarks to use the MicroMutableOpResolver. 147- PR3: Add code to both benchmarks that prints the arena size using the [`RecordingMemoryAllocator`](https://github.com/tensorflow/tensorflow/blob/ee87d58a6504375c28f21ea303f0eefa29118c38/tensorflow/lite/micro/docs/memory_management.md#recording-memory-apis). 148 149For each kernel: 150- PR1: Refactor the implementation to support the new kernel variant registration API. 151- PR2: Add 16x8 support and make sure that the benchmark binary and arena sizes are unchanged. 152 153Note that @njeffrie from the TF Lite Micro team also plans to prepare PR(s) for the kernels that are of interest internally 154(without using the kernel variant registation API for binary size). This will provide some quick examples of porting the kernels. 155 156## Questions and Discussion Topics 157