1<!-- mdformateoff(b/169948621#comment2) --> 2 3<!-- 4Semi-automated TOC generation with instructions from 5https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc 6--> 7 8<!--ts--> 9 10* [Summary](#summary) 11* [High-Level Steps](#high-level-steps) 12 * [Why not Optimize the Reference Kernels](#why-not-optimize-the-reference-kernels) 13* [Software Architecture](#software-architecture) 14 * [Hardware-specific NN library](#hardware-specific-nn-library) 15 * [Optimized Kernels](#optimized-kernels) 16 * [Build System Integration](#build-system-integration) 17 * [Testing and Continuous Integration](#testing-and-continuous-integration) 18 19<!-- Added by: advaitjain, at: Wed 17 Feb 2021 02:14:16 PM PST --> 20 21<!--te--> 22 23# Summary 24 25This guide describes the recommended high-level architecture and steps to add 26hardware-specific optimized kernels to TfLite Micro. 27 28The goal with these optimizations and the process that we recommend to getting 29them merged into the TfLite Micro codebase is to have a measurable and 30documented performance improvement on a benchmark of interest. 31 32Once the optimizations are merged, they will indeed be used for more than the 33benchmark but the context for why the optimizations were added is still very 34important. 35 36# High-Level Steps 37 381. Pick a benchmark that you would like to measure the performance for. 39 40 * Existing benchmarks are in the [benchmarks directory](../benchmarks). 41 * If none of the existing benchmarks capture your use-case, then please 42 create a github issue or start a thread on micro@tensorflow.org to 43 figure out how to add in a new benchmark. 44 * If adding a publicly-available benchmark to the TFLM codebase is 45 determined to be infeasible, then a fall-back would be to have an 46 internal benchmark that can be used to document the benefits of adding 47 in the optimizations via PR descriptions. 48 * Adding optimized code without any associated benchmarks will need very 49 strong justification and will most likely not be permitted. 50 511. Do the groundwork and architecture needed to be able to add in optimizations 52 for your target (more details in the 53 [software architecture](#software-architecture) section). 54 551. Create one pull request for each optimized kernel with the PR description 56 clearly stating the commands that were used to measure the performance 57 improvement. 58 59 * This context is important even if the toolchain is proprietary and there 60 are currently a small number of users. 61 * See [this PR](https://github.com/tensorflow/tensorflow/pull/47098) 62 as an example. 63 * At minimum the latency with and without the particular optimized 64 kernel should be documented. 65 [Additional context](https://github.com/tensorflow/tensorflow/pull/46746) 66 may also be desirable. 67 * Here is some 68 [general guidance](https://testing.googleblog.com/2017/09/code-health-providing-context-with.html) 69 on writing 70 [good PR descriptions](https://google.github.io/eng-practices/review/developer/cl-descriptions.html) 71 72## Why Not Optimize the Portable Reference Kernels? 73 74We would like to explicitly point out (as have others) that the reference kernel 75implementations are not performant and there are plenty of opportunities to 76speed them up. This is by design and the reference kernels are meant to be a 77shared starting point to then be optimized in a target specific optimized kernel 78implementation. 79 80Two previous discussions on this topic are on 81[PR #42477](https://github.com/tensorflow/tensorflow/pull/42477) and 82[PR #45227](https://github.com/tensorflow/tensorflow/pull/45227) 83 84Our current point of view on this topic is that while optimizing shared 85reference code in a portable manner is attractive, we are making an explicit 86choice to not go down that path and instead rely on target-specific optimized 87implementations. The TFLM codebase has a growing list of optimized kernel 88implementations, and we are investing in making the process of adding new 89implementations smoother. 90 91# Software Architecture 92 93The optimized kernel architecture is composed of the following three modules: 94 951. Hardware-specific NN library 961. Optimized Kernels 971. Build System Integration 98 99## Hardware-specific NN library 100 101This library uses knowledge of the hardware and compiler to implement the 102underlying operations. Examples of this are 103[CMSIS-NN](https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN) from 104ARM and [NNLib](https://github.com/foss-xtensa/nnlib-hifi4) from Cadence. 105 106The benefits of having this API separation are: 107 1081. The NN library does not need to follow the style guide of the rest of the 109 TFLM code. 1101. Releases of the NN library can be made independent of TFLM 1111. The same NN library can be used and tested independent of TFLM. 1121. The maintainers of the NN library have full control over the development 113 process that they would like to follow. 114 115## Optimized Kernels 116 117These will be (hopefully thin) wrappers that act as the glue between TFLM and 118the NN library. 119 120The goal here is to delegate as much work as possible to the NN library while 121still allowing the two APIs (TFLM and NN library) to be independent of each 122other. If there is a performance degradation due to this (for example, 123unnecessary memory copies) then we can evaluate those on a case-by-case basis. 124 125This code will be reviewed and merged in the TFLM github repository and must 126follow the development style of the TFLM codebase. 127 128Some amount of refactoring of the existing code may be needed to ensure that 129code is suitably shared between the reference and optimized kernels. There is 130currently no fixed recipe for this refactor and we will evaluate on a 131case-by-case basis during the PR review. 132 133For example, to add an optimized implementation for `fully_conntected` for the 134Xtensa Fusion F1 the steps were: * 135[PR 1](https://github.com/tensorflow/tensorflow/pull/45464): refactor for 136reference fallbacks and a baseline latency. * 137[PR 2](https://github.com/tensorflow/tensorflow/pull/46242): refactor to share 138code between reference and optimized kernels. * 139[PR 3](https://github.com/tensorflow/tensorflow/pull/46411): add the code needed 140to use the optimized NN lib and document the latency improvement. 141 142## Build System Integration 143 144This module is the least defined but we strongly recommend the following: 1. A 145single target makefile.inc for all the architectures that you would like to 146support along with optional target-specific 147[system_setup.cc](../arduino/system_setup.cc). See 148[cortex_m_generic_makefile.inc](../tools/make/targets/cortex_m_generic_makefile.inc) 149and [xtensa_makefile.inc](../tools/make/targets/xtensa_makefile.inc) as 150examples. 151 1521. A single `ext_libs.inc` (and associated scripts) that downloads any external 153 dependencies (including the NN library). For example: 154 155 * [cmsis_nn.inc](../tools/make/ext_libs/cmsis_nn.inc) and 156 [cmsis_download.sh](../tools/make/ext_libs/cmsis_download.sh) 157 * [xtensa.inc](../tools/make/ext_libs/xtensa.inc) and 158 [xtensa_download.sh](../tools/make/ext_libs/xtensa_download.sh) 159 1601. The optimized kernels will then live in a kernels subdirectory (e.g. 161 [kernels/cmsis_nn](../kernels/cmsis_nn) and 162 [kernels/xtensa](../kernels/xtensa)) 163 164Two development workflows that the TFLM team would like to encourage and 165support: 166 1671. Export static library + headers into target-specific development environment 168 169 * Build a static libtensorflow-microlite.a using the TFLM makefile with: 170 `make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target> 171 OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite` 172 * Use the static library and any TFLM headers as part of the overall 173 application (with its own build system). 174 1751. Integrate TFLM with IDE: 176 177 * This has historically been done using the TFLM Makefile’s support for 178 project generation. 179 180 * However, given the learning curve and high-maintenance overhead, we are 181 moving away from supporting project generation via the Makefile and are 182 encouraging future IDE integrations to be done outside of the TFLM 183 Makefiles. 184 185 * The TFLM team is currently working through the details on this topic. 186 187## Testing and Continuous Integration 188 189The kernel tests are the primary method of ensuring that the optimized kernel 190implementations are accurate. 191 192Currently, most of the tests require the optimizations to be bit-exact to the 193quantized reference implementation. We can revisit this requirement if it ends 194up having a high associated cost on the latency. 195 196We strongly encourage optimized kernel implementations to have an associated 197continuous build that runs through all the unit tests and publishes a build 198badge to the 199[TFLM community supported builds](../README.md#community-supported-builds) 200table. Running the units tests once a day is often a good place to start. 201