1# EmbARC MLI Library Based Optimizations of TensorFlow Lite Micro Kernels for ARC Platforms. 2 3## Maintainers 4 5* [dzakhar](https://github.com/dzakhar) 6* [JaccovG](https://github.com/JaccovG) 7* [gerbauz](https://github.com/gerbauz) 8 9## Introduction 10 11This folder contains kernel implementations which use optimized 12[embARC MLI Library](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli). 13It allows acceleration of inference operations which use int8 (asymmetric 14quantization). 15 16## Usage 17 18embARC MLI Library is used to speed up execution of some kernels for 19asymmetrically quantized layers and can be applied with the option `OPTIMIZED_KERNEL_DIR=arc_mli`. 20This means that usual project generation for 21ARC specific target implies usage of embARC MLI. 22 23For example: 24 25``` 26make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx OPTIMIZED_KERNEL_DIR=arc_mli generate_person_detection_int8_make_project 27``` 28 29In case MLI implementation can’t be used, kernels in this folder fallback to 30TFLM reference implementations. For applications which may not benefit from MLI 31library, projects can be generated without these implementations **removing** `OPTIMIZED_KERNEL_DIR=arc_mli` in the command line, which can reduce overall code size: 32 33``` 34make -f tensorflow/lite/micro/tools/make/Makefile TARGET=arc_vpx generate_person_detection_int8_make_project 35``` 36 37For ARC EM SDP board, a pre-compiled MLI library is downloaded and used in the 38application. For a custom target ARC-based platform, MLI sources are downloaded 39and compiled during project generation phase. To build library from sources for 40ARC EM SDP platform, add `BUILD_ARC_MLI=true` option to make command: 41 42``` 43make -f tensorflow/lite/micro/tools/make/Makefile \ 44TARGET=arc_emsdp \ 45OPTIMIZED_KERNEL_DIR=arc_mli \ 46BUILD_ARC_MLI=true \ 47generate_person_detection_int8_make_project 48``` 49--- 50### Optional (experimental features): 51 52TFLM can be built using [embARC MLI Library 2.0](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA) as an experimental feature. 53To build TFLM using the embARC MLI Library 2.0, add the following tag to the command: 54``` 55ARC_TAGS=mli20_experimental 56``` 57In this case, generated projectes will be in <tcf_file_basename>_mli20_arc_default folder. 58 59Some of configurations may require a custom run-time library specified using the BUILD_LIB_DIR option. Please, check MLI Library 2.0 [documentation](https://github.com/foss-for-synopsys-dwc-arc-processors/embarc_mli/tree/Release_2.0_EA#build-configuration-options) for more details. The following option can be added: 60``` 61BUILD_LIB_DIR=<path_to_buildlib> 62``` 63--- 64If an application exclusively uses accelerated MLI kernel implementations, one 65can strip out TFLM reference kernel implementations to reduce code size of 66application. Build application with `MLI_ONLY=true` option in generated project 67(after the project was built): 68 69``` 70cd tensorflow/lite/micro/tools/make/gen/arc_vpx_arc_default/prj/person_detection_int8/make 71 72make app MLI_ONLY=true 73``` 74 75if you try this and application execution fails, then most probably MLI can’t be 76used for some nodes and you need to revert to using TFLM reference kernels. 77 78## Limitations 79 80Currently, the MLI Library provides optimized implementation only for int8 81(asymmetric) versions of the following kernels: 821. Convolution 2D – Per axis 83quantization only, `dilation_ratio==1` 842. Depthwise Convolution 2D – Per axis 85quantization only, `dilation_ratio==1` 863. Average Pooling 874. Max Pooling 885. Fully Connected 89 90## Scratch Buffers and Slicing 91 92The following information applies only for ARC EM SDP, VPX and other targets with XY or VCCM 93memory. embARC MLI uses specific optimizations which assumes node operands are 94in XY, VCCM memory and/or DCCM (Data Closely Coupled Memory). As operands might be 95quite big and may not fit in available XY or VCCM memory, special slicing logic is 96applied which allows kernel calculations to be split into multiple parts. For 97this reason, internal static buffers are allocated in these X, Y, VCCM and DCCM memory 98banks and used to execute sub-calculations. 99 100All this is performed automatically and invisible to the user. Half of the DCCM 101memory bank and the full XY banks or 3/4 of VCCM bank are occupied for MLI specific needs. 102If the user needs space in XY or VCCM memory for other tasks, these arrays can be reduced by 103setting specific sizes. For this, add the following option to build command 104replacing **<size[a|b|c]>** with required values: 105 106**For EM:** 107``` 108EXT_CFLAGS=”-DSCRATCH_MEM_Z_SIZE=<size_a> -DSCRATCH_MEM_X_SIZE=<size_b> -DSCRATCH_MEM_Y_SIZE=<size_c>” 109``` 110**For VPX:** 111``` 112EXT_CFLAGS=”-DSCRATCH_MEM_VEC_SIZE=<size_a>” 113``` 114 115For example, to reduce sizes of arrays placed in DCCM and XCCM to 32k and 8k 116respectively, use next command: 117 118``` 119make app EXT_CFLAGS=”-DSCRATCH_MEM_Z_SIZE=32*1024 -DSCRATCH_MEM_X_SIZE=8*1024” 120``` 121 122## License 123 124TensorFlow's code is covered by the Apache2 License included in the repository, 125and third party dependencies are covered by their respective licenses, in the 126third_party folder of this package. 127