Abstract
With the many varieties of AI hardware prevailing on the market, it is often hard to decide which one is the most suitable to use but not only with the best performance. As there is an industry-wide trend demand for deep learning deployment, the inference benchmark for the effectiveness of DNN processor becomes important and is of great help to select and optimize AI hardware. To systematically benchmark deep learning deployment platforms, and give more objective and useful metrics comparison. In this paper, an end to end benchmark evaluation system was brought up called IBD, it combined 4 steps include three components with 6 metrics. The performance comparison results are obtained from the chipsets from Qualcomm, HiSilicon, and NVIDIA, which can provide hardware acceleration for AI inference. To comprehensively reflect the current status of the DNN processor deploying performance, we chose six devices from three kinds of deployment scenarios which are cloud, desktop and mobile, ten models from three different kinds of applications with diverse characteristics are selected, and all these models are trained from three major training frameworks. Several important observations were made by using our methodologies. Experimental results showed that workload diversity should focus on the difference came from training frameworks, inference frameworks with specific processors, input size and precision (floating and quantized).
Introduction
As deep learning has revolutionized many application domains, there has been a growing demand for newer and better hardware and software platforms to support the deployment of even more sophisticated deep neural network (DNN) models. DNN architectures involve heavy computation and require hardware to provide massive computing power. According to a compute task, the inference is a subset of the training process. Even so, the difference in data dependency between the two is very large. In other words, the difference between the two in terms of memory access needs is very large. Therefore, we are seeing that AI chips for training and inference gradually move towards differentiated solutions, and the development route is still iterating rapidly.
While doing the inference task, DNN architectures such as convolutional neural networks (CNN), involve specific computation and require hardware, such as CPU, GPU, and AI accelerators, with specified inference software framework. A large variety of customized hardware architectures, ranging from specialized GPUs (e.g., NVIDIA Tesla [1]), FPGAs (e.g., DPU [2] and [3]), ASICs (e.g., Cambricon-x [4], to TPU [5], and several SOCs with AI accelerators such as Exynos NPU [6], Da Vinci NPU [7], etc.) have emerged for inference only. Besides, DNN hardware acceleration on a specific platform needs the optimized framework. Recently, several new tools were released, such as TensorFlow Lite [8] for edge/mobile side, TensorRT [9] for cloud side, and some other vendors’ tools.
There does need a useful methodology to compare them and help the user to choose the most suitable one. There are few AI benchmarks available in both academia and industry. Examples are BenchNN [10], DeepBench [11], and Dawn Bench [12], which the system overall influencing factors are seldom considered. AI Matrix [13] brings up a synthetic benchmarks framework and discuss how to design DNN models only for the cloud side. BenchIP [14] mainly focus on operational design. Researchers currently make comprehensive benchmarks and profiling tools for DNN training [15, 16] but rarely focus on inference. AI-benchmark [7] gives a detailed analysis of the performance of mobile AI accelerators without cloud side platform. MLPerf [17] is a machine learning benchmark suite that has gained industry-wide support and recognition with both training and inference, but it only focuses on latency and accuracy metrics mainly.
Considering the inference differs significantly from training in deployment scenarios and workload diversity, our primary goal in this work is to bring up a new inference benchmark for DNN processor, called IBD. IBD system aims to identify bottlenecks of system optimization (in both software and hardware) and give more objective and useful metrics comparison.
For IBD, relevant metrics include accuracy, latency, throughput, power, energy efficiency, as well as accelerator utilization.
The contributions of the paper can be summarized as follows: Propos the components and workflow of the inference benchmark for DNN processor, called IBD, which can give more objective and useful metrics comparison by considering the real demand from the market. Experimental results were obtained from six devices from three kinds of deployment scenarios which are cloud, desktop and mobile, and ten models from three different applications with diverse characteristics trained from three major training frameworks were selected, which could comprehensively reflect the current status of the DNN processors deploying performance. Also, we deeply analyzed the factors that could affect the benchmark results.
The rest of the paper is structured as follows: in Section 2, the IBD system and workflow are introduced; in Section 3, the experimental setup and results are introduced; in Section 4, the experimental result is discussed and an outlook on the future work is given.
Benchmark evaluation system for DNN deployment platforms
In this part, the components and workflow of the IBD system are introduced. As for deep learning, after the training task, models are needed to be deployed on different sides such as cloud, desktop, ormobile. The algorithmic differences between training and inference lead to many differences in requirements for the underlying systems and hardware architecture.
IBD aims to bring up a highly generic benchmark system for inference task, with neither restricted hardware nor platform-specific components. This system can enable users to evaluate real hardware performance with software and application influence under consideration. And the metrics can directly reflect which one is the most suitable one to choose. Fig. 1 illustrates the components of the IBD system with its general workflow of the proposed framework, it consists of three main parts that are input, IBD platform and output.

The components and workflow of the IBD system.
The pipeline starts with a trained model via some external deep learning training framework, such as TensorFlow, Caffe2, or PyTorch. Then the trained model needs to be converted into inference hardware compatible format through specific convert tools. Later, with the benchmark tools, the output part can give the metrics results.
IBD platform
Apart from the third-party inference framework, each vendor is trying to mainly focus on vector and matrix-based instructions to make hardware acceleration for DNN inference. Such instructions and the access to them depend on the proprietary Software Development Kits (SDKs) of each vendor, such as HiSilicon chipsets with their HiAI SDK, Qualcomm chipsets with their SNPE SDK, MediaTek chipsets with their NeuroPilot SDK, Unisoc chipsets with their UNIAI SDK, Samsung chipsets with their EDEN SDK, NVIDIA GPUs with TensorRT, Arm Cortex CPUs and Mali GPUs with NN SDK [7], etc. These SDKs are incompatible and not friendly to do the porting of acceleration solutions. All these inference frameworks and SDKs with their corresponding model convert tools need to be highly customized to the DNN processors.
Output
Benchmark metrics should capture both the performance and the quality of inference. Below we describe the metrics we collect as part of the profiling process.
MACs represents multiply-add operations in convolutional neural networks.
We compared the number of specific precision (FP32, FP16, or INT8) instructions that the processor actually executed while it was active to the maximal number of specific precision instructions it can theoretically execute during this time, to determine what percentage of its floating/integer point (FPLOPS, floating-point operations per second or OPS, operations per second) capacity is utilized as follows:
Overview of benchmarks accuracy metrics
Experimental setup
Our experiment mainly contained two parts. First was to investigate how the specific variables, such as different training framework and inference framework, etc., can affect the results. According to the IBD system, 4 experiments included training framework, different model convert tools, precision comparison between floating-point and quantized model, and the input data set size were set to do a compared experiment. Then, apart from taking all the variables mentioned before into consideration, benchmark evaluation using the IBD system selected the most common used processors for inference.
To obtain representative results, three kinds of devices such as mobile phone, desktop and sever were included. Linux and Android OS were both used. The three training frameworks used in this work were TensorFlow, Caffe2, and PyTorch. And typical inference framework such as TensorRT, SNPE, HiAI, TFLITE with NNAPI, MACE were used. The experiment carried out under processor includes CPU, GPU, DSP and NPU (Neural-network Processing Unit).
The evaluation process was conducted as follows: Initialization, pre-processing and post-processing times were not considered in the overall processing time. One warm-up inference run was executed before the actual measurements. 5 consecutive inference iterations were executed and averaged to reduce variance. Ensured the experimental device ambient temperature in normal temperature. If models were run on mobile devices, the device should keep having above 80% of battery charge and pausing for 5 mins between executions.
Hardware specifications
Hardware specifications
Models used for evaluation
Training framework
One mobile side Snapdragon 845 and one cloud side NVIDIA Tesla T4 were used to do this experiment. The three most popular models were chosen, named Inception_v3, ResNet50, and MobileNet_v1 trained by Caffe2 and TensorFlow respectively. With the same input size, batch size (bs = 1), and processor with specific inference framework, Inception_v3 was run on the Snapdragon 845. ResNet50 and MobileNet_v1 were run on Tesla T4. With the same model architecture and hardware platform, shown in Fig. 2, the performance difference could be reached 6.917 samples per second between the Caffe2 and Tensorflow on Snapdragon 845, 1.9 and 2.3 times compared to Caffe2 and Tensorflow on T4.

Compared result of training framework difference.
ImageNet validation data set was used to calculate the accuracy. According to the accuracy loss results of top1 and top5 showed in Fig. 2(b), all loss was less than 4%.
Observation 1: The input model framework does influence the performance of the inference platform even if the model architecture was the same. We found that most benchmark suits only take the applications and models as the most important workload influence factors into consideration, but neglect the difference of training framework. Our work showed that the characteristics of the training framework could also influence the benchmark result.
Figure 3 shows how the input data size influenced the performance on the desktop processor GeForce GTX 1080Ti. The input size was set up from 128x128 to 1024x1024. Three representative networks based on PyTorch were used for the experiment: Inception_v3[28], MobileNet_v2[29], and ResNet101[30]. Performance decreased with the input size increasing for all three models.

Input size influence result.
Observation 2: Performance decreased with the input size increasing for all models. While the declining size was different corresponding to the model architecture difference and the processor memory as well as the bandwidth. As we expected, the larger the input size was, the lower the throughput for all models were. We concluded that to choose the most suitable input size, one should consider the model architecture and the deployment type.
To explore the performance of the model convert tools with corresponding inference framework published by the relative companies and vendors. We chose Kirin 980 and Snapdragon 855 as the test platform for this experiment. Inception_v3[22] with input size 299x299, MobileNet_v2 with input size 224x224[23], ResNet101 with input size 224x224 [21] were used. The results of four comparison experiment contained 5 different inference frameworks which are HiAI, SNPE, MACE, TFLITE with and without NNAPI respectively were showed in Fig. 4. 1k Images from ImageNet validation data set was used to calculate the accuracy, and with reasonable accuracy loss shown in Fig. 4(b), the same model with different inference framework and runtime gave huge different performance result, reflecting both the inference framework and the runtime could have a big influence to the inference performance.

Compared result of inference framework difference.
Observation 3: Inference diversity is an important influence factor when comparing the performance of different DNN processors. We found that the results of comparing the performance of models with specific processors can greatly vary from different inference frameworks, and hence the use of inference frameworks in any comparisons of DNN processors is important. For example, we observed that for the performance of these 4 models with different inference frameworks, generally could reach almost 7 40 times difference.
Figure 5 shows the Kirin 980 floating-point and quantized model performance. These two precision type models were showing nearly similar accuracy in all 7 mentioned models as shown in Fig. 5(a), and the INT8 model had approximately doubled speed over the fp16 models as shown in Fig. 5(b).

Floating-point and quantized models performance.
Observation 4: For specific models, INT8 are becoming faster and are reducing the difference between the accuracy of INT8 and FP16 inference. However, as they have different properties and show different accuracy results, the obtained numbers will make no sense. One should avoid comparing the performance of two different devices by running floating-point models on one and quantized models on the other.
According to the IBD system, our benchmark analysis focused on a set of key metrics: latency, throughput, power, and actual energy efficiency, as well as accelerator utilization. Figure 6 shows the benchmark result using NVIDIA Tesla P4&T4 as the experiment platform. According to the accuracy loss showed in Fig. 7, all loss was within 4%.

Benchmark result of T4&P4.

Accuracy loss T4&P4.
Observation 5: With an increase in the batch size for all models, latency decreased while the execution time and throughput increased. T4 delivered up to 2 4X times better throughput. As the inference is a latency and energy-sensitive task, throughput is not only the primary performance metric of concern in inference. There do have a trade-off between latency and throughput, the applicability of each approach will depend on the particular task and the corresponding hardware /energy consumption limitations.
Observation 6: ResNet50, ResNet152 and VGG16 showed equal energy efficiency. And with the increase of models’ complexity, utilization increased at the same time. These three models’ MACs were 4.12GOPs, 11.58GOPs and 15.5GOPs. Model diversity showed different throughput but did not make a big influence on energy efficiency. While did shows the different result on the lightweight model MobileNet_v1.
Four models with diverse characteristics, listed in Table 4, are evaluated. Each one presents a unique design trade-off between accuracy, parameters, and computational complexity (MACs) with the same input size and data set.
Compared four evaluated models
Observation 7: With an increase in the batch size for all models, utilization decreased. The mobileNet_v1 model had low GPU utilization. Even with the maximum batch size, the GPU INT8 utilization of the MobileNet_v1 was much lower than for the other three models.
This is the first available study focuses on the inference benchmark for DNNs on three kinds of deployment types in the cloud, desktop and mobile sides. We bring up an IBD system aims to identify bottlenecks of hardware and give more objective and useful metrics comparison. The analysis can be summarized as follows:

Models real execute time analysis.
In this paper, we propose a new benchmark suite for DNN inference, called IBD, which covers a wide range of inference deployment platform from mobile and desktop to the cloud side. IBD consists of 10 states of the art DNN models implemented on major training frameworks such as TensorFlow, Caffe2, and PyTorch. We use these models to perform extensive performance analysis and profiling to shed light on the efficiency of DNN inference for different hardware configurations. We propose the components and workflow of the IBD system, which could give more objective and useful metrics comparison to the market real demand. Additionally, four correlated comparison experiment results comprehensively reflect the current status of the DNN processors deploying performance and the deep analysis of the factors that affect the benchmark results. Also, the benchmark result analyzes the real performance of DNN processors in 6 metrics from different perspectives. By using our methodologies and metrics, we conclude several important observations and recommendations on where the future research and optimization of DNN models deployment should be focused. We hope that our IBD benchmark suite, tools, methodologies, and observations will be useful for a large number of ML developers and systems designers in making their DNN inference process efficient.
