Abstract
In the field of computer vision, Field Programmable Gate Array (FGPA) limited de on-chip memory is difficult to meet the power, size and other requirements. To address this phenomenon, the study constructs a partitioning algorithm to achieve a balance between energy consumption and resource utilisation based on the analysis of memory resource allocation, overall power consumption and resource utilisation from the perspective of image processing technology. The power consumption of the balancing algorithm is lower compared to the optimised utilisation algorithm HLS tool, with both Block Ramdom Access Memory (BRAM) power consumption taking the value of 0.005; the dynamic power consumption takes the value range of 0.014–0.082. Compared to the High Level Synthesis (HLS) tool, the overall power consumption of the balancing algorithm and the optimised utilisation algorithm is significantly lower, with the values of 0.251 and 0.252 respectively, both with a reduction rate of approximately 30%. The accuracy rate of the proposed memory optimisation allocation algorithm is the highest among the four memory optimisation allocation algorithms and strategies on all three types of target scales. FPGA memory optimisation allocation strategy can guarantee to have lower power consumption while satisfying the same resource occupancy, and the model has in-depth application value in visual image vision technology.
Introduction
Computer vision is involved in the fields of unmanned supermarkets, face recognition, driverless cars and so on, but these application scenarios will be affected by power consumption and cost, and it is difficult to complete image data processing through the traditional ordinary processor platform. To solve this problem, field programmable gate array (FPGA), as an integrated circuit, is also known as flowing solid because it can configure corresponding hardware circuits according to the different needs of designers and users [1, 2]. The design process of FPGA mainly includes device configuration, verification, design implementation, synthesis, design input, etc. it has the characteristics of high reliability, high integration and fast speed. This technology has obvious advantages in image processing. However, the on-chip memory resource of FPGA is about 6MB, and images are usually stored through external memory. Because the off-chip memory has high access latency, this will greatly reduce the performance [3, 4]. At the same time, external memory will further improve the economic cost, power consumption and size of the system. At present, the main research on the application of computer vision at home and abroad focuses on reducing power consumption and improving performance, but these research results are manifested in power consumption or performance, it is difficult to analyze the two performances at the same time, and the optimal solution under balanced conditions has not been achieved. Aiming at these problems, the research innovatively proposes to achieve efficient image processing through CAMShift (Continuous Adaptive Mean Shift) algorithm, and gives a balanced partition algorithm of FPGA on-chip memory, giving play to the advantages of strong reliability, high integration, fast speed, etc., in order to minimize the power consumption of image processing applications and the occupation of FPGA memory resources. The research content consists of the following four parts. The second part summarizes the research status of domestic and foreign scholars on the optimization of FPGA memory allocation and image processing efficiency in recent years. The third part focuses on the FPGA memory balance partition algorithm and CAMShift algorithm in image processing technology. The fourth part verifies the image processing technology through simulation experiments, and analyzes the results. The fifth part summarizes the research results, and gives the existing problems and prospects for the future.
Review of the literature
Park et al. [5] proposed an FPGA-based hardware accelerator design for handwritten Korean character recognition applications in order to solve the current problem that deep convolutional neural networks are difficult to achieve in real time through computer software, using techniques such as memory access optimization, computational unit parallelism, data conversion, and recognition time of 11.19 ms per character through Xilinx FPGA accelerator images. 11.19 ms, which greatly improves the recognition efficiency and also has significant advantages in terms of energy efficiency. Lai et al. [6] studied the impact of FPGA memory partition on power consumption, and used data placement to achieve parallel access, thereby minimizing communication power consumption. Chetan et al. [7] in addressing the important contributions of current matrices in quantum mechanics, image processing, and solving circuits, proposed the use of model-based FPGAs design to design and evaluate different floating-point matrix multiplication architectures/methods and floating-point matrix solving inversions. The model was implemented in hardware via a ZED board on a Zynq 7000 FPGA and the performance test results showed that the approach is highly feasible. Wu et al. [8] addressed the need for the current existence of multi-FPGA acceleration and proposed a database for efficient large-scale graphics processing in FPGA-accelerated data centres with minimal hardware engineering effort is a state-of-the-art graphics accelerator, and results on a 32-node Microsoft Catapult-type data centre show that the proposed communication library has strong data processing capabilities in a central data processor and is extremely scalable. Yan et al. [9] researchers and team designed an FPGA accelerator for graphics attention networks that eliminates the need for digital signal processors and large amounts of on-chip The performance evaluation was completed on a Wave F10A motherboard with an Intel Arria 10 GX1150 and 16 GB DDR3 memory, and the test results showed that the proposed accelerator runs faster and consumes less energy in the central data processor.
Wang et al. [10] designed two harmonic summation methods to address the problems of irregularity and low continuity in FPGAs’ access to a large number of points in the off-chip memory, namely, processing the input signals directly without storage, storing the intermediate data in the off-chip memory, and using an open computing language to implement them. The proposed preloading method that preloads all necessary points while reordering the input signals is faster than all other methods and has lower energy consumption. Menasri et al. [11] scholars used parallel and pipelined structures in order to improve the throughput of the CABAC decoder and proposed an algorithmic architecture equation for implementing the CABAC decoder on FPGAs. Test results showed that the implementation could handle 2.2 bins/cycle and exhibit a high throughput of 271.678 Mbins/s when operating at 123.49 MHz. Jiang et al. [12] proposed a method for implementing peak detection on FPGAs that explores the possibility of parallel architectures to accelerate peak detection and achieve system miniaturisation. The test results showed that the proposed method effectively reduced the peak detection error by 42.81% and the ranging error by 63.63% compared with the peak detection module using the Caruana method, and ensured the real-time update of the ranging results. Bao et al. [13] designed an instruction driven convolutional neural network accelerator based on FPGA, which is widely used as a platform to accelerate cellular neural networks because of its good performance, energy efficiency and flexibility, and applied it to image target detection. The performance of the accelerator and scheduling strategy was evaluated through Xilinx KU115 FPGA platform. The framework achieved 2.3 times and 50 times energy efficiency respectively. The accuracy of the fixed-point detection algorithm decreases by less than 1% compared with the floating-point algorithm. Zhao et al. [14] reorganized the memory layout at the algorithm level, and placed data through lattice based partitions accordingly by checking the calculation behavior.
Combined with the research status of many domestic and foreign scholars, it can be seen that the research on image processing technology mainly focuses on image data analysis and images, although some of the scholars’ involve FPGA memory optimisation allocation, but the research has made little progress in related studies, and FPGA memory optimisation allocation only single analysis of resource utilisation and energy consumption and other aspects. The research builds on previous research results and proposes a partitioning algorithm applied to FPGA memory optimisation allocation in anticipation of achieving efficient resource utilisation and lower energy consumption of FPGA memory, which in turn enhances the effectiveness of image processing techniques.
FPGA memory balancing partitioning algorithms in image processing technology
CAMShift algorithm in advanced image processing techniques
With the rapid development of computer vision technology and image processing technology, target tracking and detection in video images has a positive significance. The main methods of target tracking in images are generating method and discriminant method. The generating methods mainly include mean shift, Kalman filter, particle filter and sparse coding. CAMShift algorithm is the most commonly used mean shift target tracking algorithm, which has the following characteristics. The algorithm complexity is small; It is a parameterless algorithm, easy to integrate with other algorithms; Weighted histogram modeling is adopted, which is insensitive to small angle rotation, slight deformation and partial occlusion of the target. CAMShift algorithm is an adaptive tracking algorithm, which obtains the confidence of object position according to the color histogram of the object in the previous image. When the object position has been determined and the color histogram in the
Schematic diagram of mean shift principle.
The Mean-Shift algorithm is a mean-shift algorithm that marks a target object by the shape and size of the region of interest alone. The feature space consists of all attributes during image processing, and after the image attributes have been mapped the feature space is a dense set of data points consisting of significant feature attributes. The principle of mean shift is shown in Fig. 1. The initial window position, shape and centre of mass are first determined. Then a new window shift vector is obtained using the mean shift algorithm. Finally the mean shift vector is continuously adjusted to ensure that the probability density function value is maximised. The multivariate density kernel function
In Eq. (3),
The updated centre-of-mass position is obtained by combining the initial centre-of-mass position. The principle of RGB differencing is to determine the detection area of a target based on the difference between the luminance values of pixels in the shadow and non-shadow regions of RGB space [15, 16]. G and B refer to the colours red, green and blue, respectively, where the digital image consists of these three shades of grey. Figure 2a and b refer to the RGB colour cube diagram and the pixel points of the RGB colour image, respectively. the model of HSV colour can also be referred to as the hexagonal vertebra model, with the parameters in colour being lightness, saturation and hue, respectively. This model is widely used in image editing.
RGB color cube diagram and pixel diagram.
The formula for converting an RGB space image to HSV colour space is Eq. (4).
Equation (4),
In Eq. (5), the total number of possible grey levels in the image is referred to by
In Eq. (6),
In order to achieve efficient image data processing, reduce energy consumption and reduce resource occupation, FPGA memory balance partition algorithm is proposed to solve this problem. The partitioning algorithm first analyses the problem of resource utilisation and energy consumption in image processing, and then presents a design approach for balancing utilisation and power consumption. For the resource utilisation problem in image processing, each image frame buffer is set to use only one possible BRAM (Block Ramdom Access Memory). The study gives four related definitions, as follows. For the possible configurations of the BRAM
In Eq. (7), the first and second components of each element represent the width
All frame images of any size have several possible Bram topologies. Although the total physical capacity is roughly the same, in some configurations additional data bits can be used as parity bits. The image frame is set as a three-dimensional array, 1 and 2 represent the size width and height respectively, and 3 refers to the pixel bit width. The mapping diagram of Bram topology based on 3D to 2D array is shown in Fig. 3.
Mapping based on 3D to 2D array.
The main sequence of the mapping scheme used in the study is rows and columns. Set
The expression for a two-dimensional array is Eq. (10).
In Eq. (10), the integer division and modulus are represented by
In Eq. (11), linear combinations are referred to by the designator, with one and only one possessing a non-zero component.
Schematic diagram of 2D array mapping to Bram.
A two-dimensional array of size
The efficiency problem can be understood as maximising the utilisation of the partitioning scheme. For a scale width and height of
Equation (13) refers to the FGRA of the Xilinx Virtex 7 series 18K-bit BRAM, where the total capacity of the BRAM. The expression for the partition scheme is Eq. (14).
In Eq. (14), the partitioning scheme occupies 64 BRAMs and has a storage efficiency of 0.5208333. This is the default method for High Level Synthesis (HLS), the configuration is
The upward rounding result of the 2-log operation in Eq. (15) is
For the energy consumption of image processing technology, the static power consumption and utilization ratio of Bram show a positive proportion; The dynamic power consumption of Bram is generated through a series of operations. The data reading process is as follows: gate the clock signal; Decoding address; The data is connected to the multiplexer; Transfer data to Bram external port [19, 20]. The writing process is as follows: gate the clock signal; Transfer data to write buffer; Decoding address; Store data in random access memory (RAM) unit. In the partition presented in Eq. (14), each data is distributed among 8 BRAMs. The power consumption of the read and write operations is reduced when the partitioning is optimised by choosing a configuration scheme. Therefore, the study can be optimised by satisfying the widest configuration of BRAMs at a multiple of Bw, but the static power consumption will increase when increasing the number of BRAMs, as well as the chip and address power consumption. Therefore, the study needs to balance utilisation and power consumption. Firstly, a choice of optimisation algorithm (optimisation utilisation algorithm) is required. Secondly, there is a need to balance utilisation and power consumption (balancing algorithm). The method of comparing configuration efficiency with maximum efficiency is an extremely limited search, and Table 1 refers to the classical configuration, set 160 at the same time
Configuration parameters for utilization optimization
Study the setting of the balance relationship between utilization and energy consumption according to user-defined trade-offs and power experience attributes, as well as power and space heuristics. The balance algorithm chooses the solution of optimizing utilization as the premise, and finds the configuration of Bram through continuous iteration. When the balance algorithm finds the first solution below the threshold, it needs to exit the process of configuration optimization and return to the last solution above the threshold.
Configuration parameters obtained by balancing utilization and power consumption algorithm
Configuration parameters obtained by balancing utilization and power consumption algorithm
Test results of Bram and dynamic power consumption in sliding window.
In order to analyze the effect of FPGA memory optimization configuration scheme in image processing technology, the hardware used in the research is Xilin virtex 7 FPGA, and the power characterization tool is Xilinx power estimator. This paper studies the power consumption and utilization of the generated frame buffer configured by the partitioning algorithm, and compares it with the corresponding value of the equivalent frame buffer of FPGA memory optimization algorithm in other image processing technologies to verify the performance of the proposed FPGA memory optimization allocation partitioning algorithm. Firstly, the power consumption of each system is estimated by Xilinx power estimator. In the sliding window experiment, use 3
Resource usage of different memory optimization algorithms/number
Power consumption comparison results of all versions.
Table 2 refers to the configuration parameters obtained by the algorithm for balancing utilisation and power consumption. In comparison to the parameters obtained by the optimised utilisation algorithm, the width of most of the configuration parameters will be increased, which will reduce the power consumption, both in terms of dynamic power consumption and BRAM power consumption.
In order to analyze the energy consumption of different algorithms in amsshift algorithm of image processing technology, the size of frame image is set to 320
Accuracy comparison results under different frame image pixels.
Variation rules of classification accuracy and number of image sequences under different SNR.
Table 3 refers to the resource usage and performance of different memory optimisation configuration algorithms, as specified by FringeFieldSwitching (FFS), Look-Up-Table (LUTs), Internetworking Operating System-Cisco (The comparison algorithms are the FPGA dual-port memory mapping optimization algorithm (Method 1) and the FPGA triple-addressable memory acceleration update mechanism (Method 2)). As can be seen from the table, the partitioning algorithm has a higher resource usage compared to the other two memory optimisation allocation algorithms, which may be closely related to the number of BRAMs used. The number of BRAMs used by the HLS tool is about 1.5 times higher than that of the partitioning algorithm, whereas for IOs, DSPs, FFSs and LUTs, the partitioning algorithm’s is not affected.
Experiments were conducted to compare the accuracy at different frame image pixels. The results of the HLS, partitioning algorithm, FPGA dual-port memory mapping optimisation algorithm (Method 1), and FPGA triple addressable memory accelerated update mechanism (Method 2) algorithms are shown in Fig. 7a–d. respectively. For the HLS tool, the best image frame pixels for the three target scales of general target scale, small target scale and minimum target scale are 224 dpi, 278 dpi and 254 dpi respectively, and the accuracy rates obtained are 95.8%, 86.7% and 87.5% in that order. Combined with Fig. 7b, the best frame pixels for the three target scales of general target scale, small target scale and minimum target scale can be determined as 286 dpi, 258 dpi and 246 dpi, and the accuracy rates obtained are 99.1%, 86.7% and 88.6% respectively. For methods 2 and 3, the best image frame pixels for the three target scales of general, small and minimum target scales are all around 250 dpi, but the accuracy rates of both algorithms are reduced, and the accuracy rates are taken to be in the range of 70.9% to 83.5%. Therefore, the memory optimisation allocation algorithm proposed in the study has the highest accuracy rate among the four memory optimisation allocation algorithms and strategies for all three types of target scales. Compared to Methods 1 and 2, the partitioning algorithm and has the fastest improvement in accuracy rate at the smallest target scales, providing an accuracy rate of nearly 40%. The algorithm’s accuracy rate also improved somewhat for the general and large target scales, with an improvement of approximately 20%.
The experiments further verify the accuracy of the proposed partitioning algorithm for image processing results. The variation pattern of classification accuracy and the number of image sequences under different signal-to-noise ratio (SNR) is shown in Fig. 8. As the SNR increases, the maximum number of iterations increases continuously. When the SNR values are 40 dB, 50 dB and 60 dB respectively, the corresponding maximum number of iterations are 30, 40 and 65 respectively. From the figure, we can calculate the minimum magnitude difference to complete the signal classification in the distribution network fault data, and as the SNR increases, the maximum resolution capacity of the algorithm also shows an increasing trend.
In response to the problems of high space occupation and high power consumption in advanced image processing in the FGPA platform, the study proposes an FPGA memory partitioning algorithm for image processing technology. The low power consumption of the balanced algorithm and the optimised utilisation algorithm is mainly in the BRAM power consumption and dynamic power consumption, with the reduction rate of BRAM power consumption being about 85% and dynamic power consumption being about 45% in both cases. Compared to the optimised utilisation algorithm, the balanced algorithm consumes less power at different sizes, with a dynamic power reduction rate of approximately 9% and a BRAM reduction rate of approximately 43%; compared to the HLS tool, the balanced algorithm consumes less power at different sizes, with a dynamic power reduction rate of approximately 73% and a BRAM reduction rate of approximately 94%. The best pixel values of the frame images for the partitioning algorithm at the three target scales of general, small and minimal target scales are 286 dpi, 258 dpi and 246 dpi, and the accuracy rates obtained are 99.1%, 86.7% and 88.6% in that order. When the SNR values are out of 40 dB, 50 dB and 60 dB respectively, the corresponding maximum number of iterations of the partitioning algorithm is 30, 40 and 65 respectively, and the BRAM configuration and memory power model have strong advantages over other storage optimisation configurations. The partitioning algorithm proposed in the established study is applied to the entire FPGA family, which is conducive to the expansion and development of the FPGA platform at a later stage. For some special scenes, the subjective evaluation of the processing effect cannot achieve a very ideal effect. The following research needs to apply the proposed segmentation algorithm to image tracking and target tracking to enhance the application value of the segmentation algorithm.
Footnotes
Funding
The research is supported by: the Hunan Provincial Natural Science Foundation (Grant No. 2020JJ4322).
