IBD 1 : The metrics and evaluation method for DNN processor benchmark while doing Inference task

Abstract

With the many varieties of AI hardware prevailing on the market, it is often hard to decide which one is the most suitable to use but not only with the best performance. As there is an industry-wide trend demand for deep learning deployment, the inference benchmark for the effectiveness of DNN processor becomes important and is of great help to select and optimize AI hardware. To systematically benchmark deep learning deployment platforms, and give more objective and useful metrics comparison. In this paper, an end to end benchmark evaluation system was brought up called IBD, it combined 4 steps include three components with 6 metrics. The performance comparison results are obtained from the chipsets from Qualcomm, HiSilicon, and NVIDIA, which can provide hardware acceleration for AI inference. To comprehensively reflect the current status of the DNN processor deploying performance, we chose six devices from three kinds of deployment scenarios which are cloud, desktop and mobile, ten models from three different kinds of applications with diverse characteristics are selected, and all these models are trained from three major training frameworks. Several important observations were made by using our methodologies. Experimental results showed that workload diversity should focus on the difference came from training frameworks, inference frameworks with specific processors, input size and precision (floating and quantized).

Keywords

AI deep neural network processor benchmark end to end inference

1 Introduction

As deep learning has revolutionized many application domains, there has been a growing demand for newer and better hardware and software platforms to support the deployment of even more sophisticated deep neural network (DNN) models. DNN architectures involve heavy computation and require hardware to provide massive computing power. According to a compute task, the inference is a subset of the training process. Even so, the difference in data dependency between the two is very large. In other words, the difference between the two in terms of memory access needs is very large. Therefore, we are seeing that AI chips for training and inference gradually move towards differentiated solutions, and the development route is still iterating rapidly.

While doing the inference task, DNN architectures such as convolutional neural networks (CNN), involve specific computation and require hardware, such as CPU, GPU, and AI accelerators, with specified inference software framework. A large variety of customized hardware architectures, ranging from specialized GPUs (e.g., NVIDIA Tesla [1]), FPGAs (e.g., DPU [2] and [3]), ASICs (e.g., Cambricon-x [4], to TPU [5], and several SOCs with AI accelerators such as Exynos NPU [6], Da Vinci NPU [7], etc.) have emerged for inference only. Besides, DNN hardware acceleration on a specific platform needs the optimized framework. Recently, several new tools were released, such as TensorFlow Lite [8] for edge/mobile side, TensorRT [9] for cloud side, and some other vendors’ tools.

There does need a useful methodology to compare them and help the user to choose the most suitable one. There are few AI benchmarks available in both academia and industry. Examples are BenchNN [10], DeepBench [11], and Dawn Bench [12], which the system overall influencing factors are seldom considered. AI Matrix [13] brings up a synthetic benchmarks framework and discuss how to design DNN models only for the cloud side. BenchIP [14] mainly focus on operational design. Researchers currently make comprehensive benchmarks and profiling tools for DNN training [15, 16] but rarely focus on inference. AI-benchmark [7] gives a detailed analysis of the performance of mobile AI accelerators without cloud side platform. MLPerf [17] is a machine learning benchmark suite that has gained industry-wide support and recognition with both training and inference, but it only focuses on latency and accuracy metrics mainly.

Considering the inference differs significantly from training in deployment scenarios and workload diversity, our primary goal in this work is to bring up a new inference benchmark for DNN processor, called IBD. IBD system aims to identify bottlenecks of system optimization (in both software and hardware) and give more objective and useful metrics comparison.

For IBD, relevant metrics include accuracy, latency, throughput, power, energy efficiency, as well as accelerator utilization.

The contributions of the paper can be summarized as follows:

Propos the components and workflow of the inference benchmark for DNN processor, called IBD, which can give more objective and useful metrics comparison by considering the real demand from the market.

Experimental results were obtained from six devices from three kinds of deployment scenarios which are cloud, desktop and mobile, and ten models from three different applications with diverse characteristics trained from three major training frameworks were selected, which could comprehensively reflect the current status of the DNN processors deploying performance. Also, we deeply analyzed the factors that could affect the benchmark results.

The rest of the paper is structured as follows: in Section 2, the IBD system and workflow are introduced; in Section 3, the experimental setup and results are introduced; in Section 4, the experimental result is discussed and an outlook on the future work is given.

2 Benchmark evaluation system for DNN deployment platforms

In this part, the components and workflow of the IBD system are introduced. As for deep learning, after the training task, models are needed to be deployed on different sides such as cloud, desktop, ormobile. The algorithmic differences between training and inference lead to many differences in requirements for the underlying systems and hardware architecture.

IBD aims to bring up a highly generic benchmark system for inference task, with neither restricted hardware nor platform-specific components. This system can enable users to evaluate real hardware performance with software and application influence under consideration. And the metrics can directly reflect which one is the most suitable one to choose. Fig. 1 illustrates the components of the IBD system with its general workflow of the proposed framework, it consists of three main parts that are input, IBD platform and output.

Fig. 1

The components and workflow of the IBD system.

The pipeline starts with a trained model via some external deep learning training framework, such as TensorFlow, Caffe2, or PyTorch. Then the trained model needs to be converted into inference hardware compatible format through specific convert tools. Later, with the benchmark tools, the output part can give the metrics results.

2.1 Input

Trained model. To reflect the inference performance, the deep learning (DL) models running in IBD tests should represent the most popular and commonly used network architectures that can be currently deployed on specific devices. Also, IBD system should consider the limitation of becoming obsolete as DL models evolve rapidly. And the DL models should be able to reveal deep insights into interactions between DL model attributes and hardware performance. The accuracy, parameters, and computational complexity(MACs) of models should be taken into consideration. Except for the DL models architecture, the training framework should also be considered as the variable that can affect the benchmark results.

Test data set. The input data usually depends on the trained models with a specific size. As for real workloads or real models, the input size is usually corresponding to the real demands which can be completely different for each occasion. To evaluate the inference result, the number of test images should also be taken into consideration, since it may influence the accuracy result.

2.2 IBD platform

Model convert tools. DNN processors are able to complement general-purpose CPUs and perform computationally intensive work more efficiently, e.g. by favouring specific operations and data-parallel computation. As for the inference task, there are usually not enough resources to support raw large-scale DL models. Model convert tools can optimize DL models and quantize their weights in low precision that can reduce resource costs.

Additionally, as for the deployment of DL models, users are able to choose floating-point and quantized models. As one can see, there is always a trade-off between using one model type or another: floating-point models will always show better accuracy (since they can be simply initialized with the weights of the quantized model and further trained for higher accuracy), while quantized models yield faster inference. While the most important challenge is how to ensure that there is no significant loss in model accuracy after being quantized. Model convert tools should consider this difference and definitely avoid comparing the performance of two different hardware by running floating-point models on one and quantized models on the other.

Benchmark tools. With the input test data and converted models, benchmark tools can be further divided into three components. They are [Data pre-process] -> [Inference executed] -> [Post-processing]. Usually, data pre-process does the resize and normalization for the input data. Then inference framework execute function carries out the inference task with converted model and pre-processed data. Post-processing deals with the data after inference, to calculate the corresponding metrics.

Inference framework. Frameworks for the execution of DL models on mobile, embedded systems or inference engines pursue optimized deployment on devices with limited resources. Inference framework can manage memory allocation efficiently and exploiting the available hardware resources at best. There are some frameworks for deep learning acceleration on mobile devices and embedded systems, when no efficient off-the-shelf solutions were available, such as TensorFlow Lite [9], MACE [18] and Core ML [19]. Google has published the Android Neural Networks API [20] (NNAPI) for supporting Android frameworks and devices.

Apart from the third-party inference framework, each vendor is trying to mainly focus on vector and matrix-based instructions to make hardware acceleration for DNN inference. Such instructions and the access to them depend on the proprietary Software Development Kits (SDKs) of each vendor, such as HiSilicon chipsets with their HiAI SDK, Qualcomm chipsets with their SNPE SDK, MediaTek chipsets with their NeuroPilot SDK, Unisoc chipsets with their UNIAI SDK, Samsung chipsets with their EDEN SDK, NVIDIA GPUs with TensorRT, Arm Cortex CPUs and Mali GPUs with NN SDK [7], etc. These SDKs are incompatible and not friendly to do the porting of acceleration solutions. All these inference frameworks and SDKs with their corresponding model convert tools need to be highly customized to the DNN processors.

2.3 Output

Benchmark metrics should capture both the performance and the quality of inference. Below we describe the metrics we collect as part of the profiling process.

Latency. The inference is a latency-sensitive task. Here the latency is from the inference procedure without considering pre-processing and post-processing. The time ΔT_i represents the time of completing an inference task for one sample i under a given application scenario with batch size = 1. The number of test set samples is denoted by N, and the inferred delay T can be calculated as: $T = (\sum Δ T_{i}) / N$ (1)

Throughput. A key metric when evaluating inference efficiency is the number of input data samples that are being processed per second (samples per second). We refer to this metric as throughput. Here, the batch size is denoted as B, and the average latency of a batch is denoted as ΔT _ Batch, throughput then can be calculated as: $Throughput = B / Δ T_Batch$ (2)

Power. The statistical average power consumption of DNN processor completing inference task for one test sample. The power for completing one sample inference is denoted as ΔW_i, and average power consumption of the test set inference process is denoted as W (unit: watt). $W = (\sum Δ W_{i}) / N$ (3)

Energy efficiency. For the specific model network, actual op (operation) is counted as Ops. $Ops = MACs * 2$ (4)

MACs represents multiply-add operations in convolutional neural networks. $Energy efficiency = Throughput * Ops / W$ (5)

Accelerator utilization. Accelerator utilization is a different angle to measure how effectively the accelerator’s resources are being utilized.

We compared the number of specific precision (FP32, FP16, or INT8) instructions that the processor actually executed while it was active to the maximal number of specific precision instructions it can theoretically execute during this time, to determine what percentage of its floating/integer point (FPLOPS, floating-point operations per second or OPS, operations per second) capacity is utilized as follows: $Accelerator utilization = \frac{Throughput * OPs}{OPS_peak}$ (6)

Accuracy. The most important challenge is how to ensure that there is no significant loss in model accuracy after optimization. Table 1 shows an overview of benchmarks accuracy metrics with correlated information, including the three most popular application areas with the representative models and data sets, also the frameworks with available implementations.

Table 1

Overview of benchmarks accuracy metrics

Area	Application scenarios	Dataset	Reference Implementation Model	Accuracy metrics
Vision	Image classification	ImageNet	ResNet/Inception/MobileNet	Prediction accuracy
	Object detection	COCO/VOC	SSD/Mask/R-CNN/YOLO	mAP
	Segmentation	VOC	Deeplabv3 + /FCN	mIoU
	Super-Resolution	2017CVPR	VGG19/VDSR	PSNR
Language/Audio	Translation	WMT Eng-Germ	Transformer	BLEU
	Speech recognition	LibriSpeech	Deep Speech 2	WER Perplexity
Commerce	Recommendation	MovieLens-20M	NCF	Prediction accuracy

3 Evaluation

3.1 Experimental setup

Our experiment mainly contained two parts. First was to investigate how the specific variables, such as different training framework and inference framework, etc., can affect the results. According to the IBD system, 4 experiments included training framework, different model convert tools, precision comparison between floating-point and quantized model, and the input data set size were set to do a compared experiment. Then, apart from taking all the variables mentioned before into consideration, benchmark evaluation using the IBD system selected the most common used processors for inference.

To obtain representative results, three kinds of devices such as mobile phone, desktop and sever were included. Linux and Android OS were both used. The three training frameworks used in this work were TensorFlow, Caffe2, and PyTorch. And typical inference framework such as TensorRT, SNPE, HiAI, TFLITE with NNAPI, MACE were used. The experiment carried out under processor includes CPU, GPU, DSP and NPU (Neural-network Processing Unit).

The evaluation process was conducted as follows:

Initialization, pre-processing and post-processing times were not considered in the overall processing time.

One warm-up inference run was executed before the actual measurements.

5 consecutive inference iterations were executed and averaged to reduce variance.

Ensured the experimental device ambient temperature in normal temperature.

If models were run on mobile devices, the device should keep having above 80% of battery charge and pausing for 5 mins between executions.

Hardware specifications. Table 2 shows the technical specifications of the 6 devices with corresponding hardware environment in our work.

Table 2
Hardware specifications

Mobile Desktop Server

processor Kirin 980 Snapdragon 845 Snapdragon 855 NVIDIA GeForce GTX 1080Ti NVIDIA Tesla P4 NVIDIA Tesla T4

description Mobile terminal chip for Mate 20 series Mobile terminal chip First mobile platform to collectively commercialize 5 G, AI, XR Desktop Graphics Built to improve the efficiency of scalable servers to deal with deep learning workloads The world’s most advanced inference accelerator

process 7nm 10nm 7nm 16nm 16nm 12nm

CPU 2xA76@2.6GHz+ 2xA76@1.92GHz+ 4xA55@1.8GHz 4xA75@2.45GHz+4x A53@1.9GHz Qualcomm® Kryo™ 485 CPU (Octa-core) intel Core i7-7700K CPU @4.29 GHz x 8 Intel(R) Xeon(R) Silver 4114 CPU @2.20GHz Intel(R) Xeon(R) Silver 4114 CPU @2.20GHz

GPU Mali-G76@720MHz Adreno 630 Qualcomm® Adreno™ 640 GPU GeRorce GTX 1080 Ti/SSE2 Tesla P4 Tesla T4

interface USB Type-C USB Type-C USB Version 3.1; USB Type-C Support PCIe 3.0 PCIe 3.0 PCIe 3.0

system Android Android Android Windows 7-101, Linux, FreeBSDx86 Ubuntu 18/16/14/ CentOS7 / Windows10 Ubuntu 18/16/14/ CentOS7/Windows10

supported framework HiAI, AndroidNN SNPE SNPE PyTorch, TensorFlow, TensorRT PyTorch, TensorFlow, TensorRT PyTorch, TensorFlow, TensorRT

	Mobile	Desktop	Server
processor	Kirin 980	Snapdragon 845	Snapdragon 855	NVIDIA GeForce GTX 1080Ti	NVIDIA Tesla P4	NVIDIA Tesla T4
description	Mobile terminal chip for Mate 20 series	Mobile terminal chip	First mobile platform to collectively commercialize 5 G, AI, XR	Desktop Graphics	Built to improve the efficiency of scalable servers to deal with deep learning workloads	The world’s most advanced inference accelerator
process	7nm	10nm	7nm	16nm	16nm	12nm
CPU	2xA76@2.6GHz+ 2xA76@1.92GHz+ 4xA55@1.8GHz	4xA75@2.45GHz+4x A53@1.9GHz	Qualcomm® Kryo™ 485 CPU (Octa-core)	intel Core i7-7700K CPU @4.29 GHz x 8	Intel(R) Xeon(R) Silver 4114 CPU @2.20GHz	Intel(R) Xeon(R) Silver 4114 CPU @2.20GHz
GPU	Mali-G76@720MHz	Adreno 630	Qualcomm® Adreno™ 640 GPU	GeRorce GTX 1080 Ti/SSE2	Tesla P4	Tesla T4
interface	USB Type-C	USB Type-C	USB Version 3.1; USB Type-C Support	PCIe 3.0	PCIe 3.0	PCIe 3.0
system	Android	Android	Android	Windows 7-101, Linux, FreeBSDx86	Ubuntu 18/16/14/ CentOS7 / Windows10	Ubuntu 18/16/14/ CentOS7/Windows10
supported framework	HiAI, AndroidNN	SNPE	SNPE	PyTorch, TensorFlow, TensorRT	PyTorch, TensorFlow, TensorRT	PyTorch, TensorFlow, TensorRT

Models. Models should consider the application scenario and can reflect metrics under different hardware with limited resources. 10 models in three different application scenarios which were used are listed in Table 3.

Table 3

Models used for evaluation

Application	Test data	Model	Input size	MACs(GOPs)	Training framework
Image classification	ImageNet	ResNet_50[21]	224x224x3	4.12	TensorFlow/Caffe2
		ResNet-101[21]	224x224x3	7.85	Caffe2/PyTorch
		ResNet-152[21]	224x224x3	11.58	Caffe2
		Inception-v3[22]	299x299x3	5.73	TensorFlow/Caffe2/PyTorch
		MobileNet-v1[23]	224x224x3	0.3	Caffe2
		MobileNet-v2[23]	224x224x3	0.57	TensorFlow/Caffe2/PyTorch
		VGG16[24]	224x224x3	15.5	Caffe2
Object detection	VOC2007	ssd_mobilenetv1[25]	300x300x3	1.55	Caffe2
		ssd_vgg16[26]	300x300x3	31.44	Caffe2
Super-Resolution	2017CVPR DIV2K X4	vdsr[27]	256x256	43.64	Caffe2

3.2 Experimental results

3.2.1 Training framework

One mobile side Snapdragon 845 and one cloud side NVIDIA Tesla T4 were used to do this experiment. The three most popular models were chosen, named Inception_v3, ResNet50, and MobileNet_v1 trained by Caffe2 and TensorFlow respectively. With the same input size, batch size (bs = 1), and processor with specific inference framework, Inception_v3 was run on the Snapdragon 845. ResNet50 and MobileNet_v1 were run on Tesla T4. With the same model architecture and hardware platform, shown in Fig. 2, the performance difference could be reached 6.917 samples per second between the Caffe2 and Tensorflow on Snapdragon 845, 1.9 and 2.3 times compared to Caffe2 and Tensorflow on T4.

Fig. 2

Compared result of training framework difference.

ImageNet validation data set was used to calculate the accuracy. According to the accuracy loss results of top1 and top5 showed in Fig. 2(b), all loss was less than 4%.

Observation 1: The input model framework does influence the performance of the inference platform even if the model architecture was the same. We found that most benchmark suits only take the applications and models as the most important workload influence factors into consideration, but neglect the difference of training framework. Our work showed that the characteristics of the training framework could also influence the benchmark result.

3.2.2 Input size influence

Figure 3 shows how the input data size influenced the performance on the desktop processor GeForce GTX 1080Ti. The input size was set up from 128x128 to 1024x1024. Three representative networks based on PyTorch were used for the experiment: Inception_v3[28], MobileNet_v2[29], and ResNet101[30]. Performance decreased with the input size increasing for all three models.

Fig. 3

Input size influence result.

Observation 2: Performance decreased with the input size increasing for all models. While the declining size was different corresponding to the model architecture difference and the processor memory as well as the bandwidth. As we expected, the larger the input size was, the lower the throughput for all models were. We concluded that to choose the most suitable input size, one should consider the model architecture and the deployment type.

3.2.3 Inference framework

To explore the performance of the model convert tools with corresponding inference framework published by the relative companies and vendors. We chose Kirin 980 and Snapdragon 855 as the test platform for this experiment. Inception_v3[22] with input size 299x299, MobileNet_v2 with input size 224x224[23], ResNet101 with input size 224x224 [21] were used. The results of four comparison experiment contained 5 different inference frameworks which are HiAI, SNPE, MACE, TFLITE with and without NNAPI respectively were showed in Fig. 4. 1k Images from ImageNet validation data set was used to calculate the accuracy, and with reasonable accuracy loss shown in Fig. 4(b), the same model with different inference framework and runtime gave huge different performance result, reflecting both the inference framework and the runtime could have a big influence to the inference performance.

Fig. 4

Compared result of inference framework difference.

Observation 3: Inference diversity is an important influence factor when comparing the performance of different DNN processors. We found that the results of comparing the performance of models with specific processors can greatly vary from different inference frameworks, and hence the use of inference frameworks in any comparisons of DNN processors is important. For example, we observed that for the performance of these 4 models with different inference frameworks, generally could reach almost 7 40 times difference.

3.2.4 Floating-point and Quantized Inference

Figure 5 shows the Kirin 980 floating-point and quantized model performance. These two precision type models were showing nearly similar accuracy in all 7 mentioned models as shown in Fig. 5(a), and the INT8 model had approximately doubled speed over the fp16 models as shown in Fig. 5(b).

Fig. 5

Floating-point and quantized models performance.

Observation 4: For specific models, INT8 are becoming faster and are reducing the difference between the accuracy of INT8 and FP16 inference. However, as they have different properties and show different accuracy results, the obtained numbers will make no sense. One should avoid comparing the performance of two different devices by running floating-point models on one and quantized models on the other.

3.2.5 Benchmark result

According to the IBD system, our benchmark analysis focused on a set of key metrics: latency, throughput, power, and actual energy efficiency, as well as accelerator utilization. Figure 6 shows the benchmark result using NVIDIA Tesla P4&T4 as the experiment platform. According to the accuracy loss showed in Fig. 7, all loss was within 4%.

Fig. 6

Benchmark result of T4&P4.

Fig. 7

Accuracy loss T4&P4.

Observation 5: With an increase in the batch size for all models, latency decreased while the execution time and throughput increased. T4 delivered up to 2 4X times better throughput. As the inference is a latency and energy-sensitive task, throughput is not only the primary performance metric of concern in inference. There do have a trade-off between latency and throughput, the applicability of each approach will depend on the particular task and the corresponding hardware /energy consumption limitations.

Observation 6: ResNet50, ResNet152 and VGG16 showed equal energy efficiency. And with the increase of models’ complexity, utilization increased at the same time. These three models’ MACs were 4.12GOPs, 11.58GOPs and 15.5GOPs. Model diversity showed different throughput but did not make a big influence on energy efficiency. While did shows the different result on the lightweight model MobileNet_v1.

Four models with diverse characteristics, listed in Table 4, are evaluated. Each one presents a unique design trade-off between accuracy, parameters, and computational complexity (MACs) with the same input size and data set.

Table 4

Compared four evaluated models

Model	MACs	Top-1 accuracy	Parameters	Input size	layers
ResNet50	4.12GOPs,	75.3%	25.5 M	224x224x3	50
ResNet152	11.58GOPs	77%	60.4 M	224x224x3	152
VGG16	15.5GOPs	71.9%	138.3 M	224x224x3	16
MobileNet_v1	0.3 GOPs	70.81%	4.2M	224x224x3	19

Observation 7: With an increase in the batch size for all models, utilization decreased. The mobileNet_v1 model had low GPU utilization. Even with the maximum batch size, the GPU INT8 utilization of the MobileNet_v1 was much lower than for the other three models.

4 Discussions & conclusion

This is the first available study focuses on the inference benchmark for DNNs on three kinds of deployment types in the cloud, desktop and mobile sides. We bring up an IBD system aims to identify bottlenecks of hardware and give more objective and useful metrics comparison. The analysis can be summarized as follows:

Workload. The Experiment result shows that workload diversity should focus on the training framework, inference framework with a specific processor, input size and precision (floating and quantized) difference. For the execution of DL models deployed on the target device, the real execution time includes pre-processing, inference and post-process time respectively. Figure 8 shows Inception_v3 total execution time on two platforms. For the model deployment, the pre-process takes almost half execution time which means the pre-process which usually runs in CPU could be the bottleneck of the real performance. The applicability of each workload will depend on the particular task and the corresponding deployment type.

Fig. 8

Models real execute time analysis.

Metrics. We have provided a thorough comparison of the platforms with different inference framework and find that each has diverse performance for some types of models. For IBD, relevant metrics include accuracy, latency, throughput, power and energy efficiency as well as accelerator utilization. Latency and execution time mainly focus on the inference speed, and throughput shows the ability of data processed per second. With the higher ability of data processed per second often leads to long execution time, these two kinds of metrics do have a trade-off, and it depends on the particular application demand. Energy efficiency focuses on the ability of operations per second per watt. This metric mainly reflects the relationship between a particular task the corresponding hardware/energy consumption limitation. The accelerator utilization can reflect that throughput is limited by which kind of other resources, such as CPU or data communication, and further improvement can be achieved by overlapping CPU runtime or data communication with GPU execution. In a word, all these metrics show that while doing a benchmark for DNN processor, we should consider comprehensive factor analysis not only high performance. And for the DNN processor, there will always have the most suitable one but not the best one.

In this paper, we propose a new benchmark suite for DNN inference, called IBD, which covers a wide range of inference deployment platform from mobile and desktop to the cloud side. IBD consists of 10 states of the art DNN models implemented on major training frameworks such as TensorFlow, Caffe2, and PyTorch. We use these models to perform extensive performance analysis and profiling to shed light on the efficiency of DNN inference for different hardware configurations. We propose the components and workflow of the IBD system, which could give more objective and useful metrics comparison to the market real demand. Additionally, four correlated comparison experiment results comprehensively reflect the current status of the DNN processors deploying performance and the deep analysis of the factors that affect the benchmark results. Also, the benchmark result analyzes the real performance of DNN processors in 6 metrics from different perspectives. By using our methodologies and metrics, we conclude several important observations and recommendations on where the future research and optimization of DNN models deployment should be focused. We hope that our IBD benchmark suite, tools, methodologies, and observations will be useful for a large number of ML developers and systems designers in making their DNN inference process efficient.

References

Jia

, Maggioni

, Smith

, et al., Dissecting the NVidia Turing T4 GPU via Microbenchmarking[J], 2019.

xilinx[EB/OL].[2030-3-22]. https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_1/pg338-dpu.pdf

Shah

, Chaudhari

and Varghese

, Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network[J], IEEE Transactions on Neural Networks & Learning Systems, 2018:1–13.

Zhang

, Du

, Zhang

, et al., Cambricon-X: An accelerator for sparse neural networks[C]// IEEE/ACM International Symposium on Microarchitecture. ACM, 2016.

Kumar

, Bitorff

, Chen

, et al., Scale MLPerf-0.6 models on Google TPU-v3 Pods[J], 2019.

Jinook

, Yunkyo

, Seok

Jun-P.

, et al., 7.1 an 11.5 tops/w 1024-mac butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile soc. In 2019 IEEE International Solid-State Circuits Conference (ISSCC), pages 130–132. IEEE, 2019.

Ignatov

, Timofte

, Kulik

, et al., AI Benchmark: All About Deep Learning on Smartphones in 2019[J], 2019.

Frajberg

, Fraternali

and Torres

R.N.

, Convolutional neural network for pixelwise skyline detection. In: International Conference on Artificial Neural Networks. pp. 1220. Springer (2017)

Migacz,

Szymon

, NVIDIA 8-bit inference width TensorRT, GPU Technology Conference, 2017.

10.

Chen,

Tianshi

, et al., BenchNN: On the broad potential application scope of hardware neural network accelerators, 2012 IEEE International Symposium on Workload Characterization (IISWC), IEEE, 2012.

11.

Narang

, DeepBench, https://github.com/baidu-research/DeepBench, 2016.

12.

Coleman,

Cody

, et al., Dawnbench: An end-to-end deep learning benchmark and competition, Training 100.101 (2017), 102.

13.

Wei

, Xu

, Jin

, et al., AI Matrix –Synthetic Benchmarks for DNN[J], 2018.

14.

Tao

J.H.

, Du

Z.D.

, Guo

, et al., BenchIP: Benchmarking Intelligence Processors[J], Journal of Computer Science and Technology, 2018, 33(1):1–23.

15.

Zhu,

Hongyu

, et al., Tbd: Benchmarking and analyzing deep neural network training, arXiv preprint arXiv:1803.06905 (2018).

16.

Wang

Y.E.

, Wei

G.Y.

and Brooks

, Benchmarking TPU, GPU, and CPU Platforms for Deep Learning[J], 2019.

17.

Mlperf: Fair and useful benchmarks for measuring training and inference performance of ml hardware, software, and services, http://mlperf.org.

18.

XiaoMi.MACE. https://github.com/XiaoMi/mace

19.

Hanhirova,

Jussi

, et al., Latency and throughput characterization of convolutional neural networks for mobile computer vision, Proceedings of the 9th ACM Multimedia Systems Conference, 2018.

20.

Cho

H.D.

, Engineer

P.D.P.

, Chung

and Kim

, Benefits of the big. little architecture, EETimes, Feb (2012).

21.

, Zhang

, Ren

, et al., Deep Residual Learning for Image Recognition[C], IEEE Conference on Computer Vision & Pattern Recognition, IEEE Computer Society, 2016.

22.

TensorFlow-Slim image classification model library, N. Silberman and S. Guadarrama, 2016. https://github.com/tensorflow/models/tree/master/research/slim

23.

Howard

A.G.

, Zhu

, Chen

, et al., MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications[J], 2017.

24.

Simonyan,

Karen

and Zisserman,

Andrew

, Very Deep Convolutional Networks for Large-Scale Image Recognition[J], Computer ence, 2014.

25.

Google.model[EB/OL]. https://storage.googleapis.com/models-hao/mobilenet-v1-ssd-mp-0_675.pth

26.

Google.model[EB/OL]. https://storage.googleapis.com/models-hao/vgg16-ssd-mp-0_7726.pth

27.

Kim

, Lee

J.K.

and Lee

K.M.

, Accurate Image Super-Resolution Using Very Deep Convolutional Networks[C], 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.

28.

Pytorch.models[EB/OL]. https://download.Pytorch.org/models/inception_v3_google-1a9a5a14.pth

29.

Pytorch.models[EB/OL]. https://download.Pytorch.org/models/mobilenet_v2-b0353104.pth

30.

Pytorch.models[EB/OL]. https://download.Pytorch.org/models/ResNet101-5d3b4d8f.pth