Abstract
A hybrid architecture for transforms such as N-point Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Sine Transform (DST) and Discrete Wavelet Transform (DWT) has been proposed and implemented using triple matrix product method. There is limited work reported on efficient single architecture that can perform multiple transforms simultaneously or serially depending on the application requirement. The Hybrid architecture implemented, can compute various transforms efficiently. A controller is designed which can perform different transforms using the hybrid architecture based on the input provided. The implemented systolic array can be used for computing the diagonal elements of triple-matrix product. The designed architecture produces the output of transform sequence in order, which avoids reordering at output. The implemented architecture can be used to handle large sized transforms by repeatedly using fixed size architecture for a large number of points without increasing the number of Processing Elements (PEs). The proposed architecture has been validated with a watermarking algorithm that uses DCT and DWT transforms and its performance analyzed. The proposed hybrid architecture is implemented on Spartan-7 xc7s100fgga676-1. The simulation results are given and analyzed against standalone architecture.
Introduction
Signal Processing Transforms are used to transform a signal from one domain to another domain. Some of the widely used transforms in signal processing are Discrete Fourier Transform (DFT), Discrete Sine Transform (DST), Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), Fractional Fourier Transform (Fr FT), Discrete Hartley Transform (DHT), Hilbert Transform, Hadamard Transform, Hough Transform, Discrete Walsh Transform etc...
Systolic array is a form of pipe-lining, sometimes in more than one dimension. Systolic computing is a set of simple processing elements with regular and local connections which takes external inputs and processes them in a predetermined manner in a pipe-lined fashion. The first systolic structure was developed by Kung [1] for DFT. After that, many designs using systolic architectures were proposed. An early paper by Aravena explains the parallel representation of linear algorithms, which converts the matrix-vector operations to sum of small size triple-matrix products [2]. In the paper [4], a novel hybrid architecture to compute 8-point 1-D transforms such as DCT, DFT and HWT is proposed. The hardware is implemented based on Matrix decomposition algorithm i.e, Matrix Decomposition of the 8x8 DCT, DFT and HWT. The transforms are proposed for 1-D, 8x8 and forward. In the paper [3], a unified VLSI implementation of the DCT /DST /IDCT/IDST is proposed. The hardware implementation is done based on cyclic convolution structures of transforms. The transforms are proposed only for DCT /DST /IDCT/IDST.
Though there are unifies architectures reported for few multiple transforms, there is no efficient single architecture that can perform multiple transforms simultaneously or serially depending on the application requirement. The objective is to design and implement a scalable architecture for multiple transforms like DFT, DCT, DST, DHT and DWT.
In [4], 2D systolic implementation of DFT is proposed. In paper [5], the DFT algorithm proposed in [4] is rewritten to form a Triple-Matrix-Product form to map it to 2D systolic array. In this paper, the same is extended to multiple transforms such as DCT, DST and DWT and applied to a watermarking application.
Rest of this report is organized as follows. Section 2 describes the matrix-based representation of the algorithms. Hybrid multi-transform architecture that can compute different transforms is described in Section 3. Section 4 contains Simulation results and Conclusions are given in Section 5.
Matrix based representation of transforms
The conversion of DFT algorithm into a triple-matrix product representation and mapping on to a 2D systolic array structure is explained in [5]. In this work, the concept is extended to DCT, DST and DWT.
Triple matrix product form of 1-D DCT
Discrete Cosine Transform (DCT) of an N point sequence x(n) is given by,
Discrete Sine Transform (DST) of an N point sequence x(n) is given by,
Based on the size of the input sequence, it is possible to represent the wavelet coefficients in the form of a NxN matrix. In such cases, the output of the wavelet transform will be the matrix multiplication of a NxN matrix with a Nx1 column vector resulting in a Nx1 column vector. The matrix form can be expressed as below:
The 2-D DWT performs the transformation of a 2-D input signal of size NxN into a transformed domain. This is carried out in two stages, first, by multiplying NxN input matrix with a NxN DWT coefficient matrix and in the second stage the result is multiplied with the transpose of the coefficient matrix. Essentially, it gives a triple matrix product form as shown below:
The proposed hybrid architecture has two types of PEs A and B as in [5]. The systolic arrays A1X and A2X has PEs A and the systolic array A1X_B1, A1X_B2, A2X_B1 and A2X_B2 has PEs B as shown in Fig. 1. PEs A are used to compute double matrix product. The results of PEs A are multiplied with PEs B to obtain triple matrix product.

Proposed hybrid architecture.
Let the triple-matrix-product be C=A.X.B. The matrices A and B are co-efficient matrices and X is input matrix. The input data matrix X has real elements, A and B matrix will have both real and complex values as elements. Hence,
All the inputs are sent in a sequential manner. A real , A imag , B real and B imag are sent to arrays A1_COFF, A2_COFF, B1_COFF and B2_COFF respectively as shown in Fig. 1. Output of the systolic arrays A1X_B1, A1X_B2, A2X_B1 and A2X_B2 are A real * X * B real , A real * X * B imag , A imag * X * B real and A imag * X * B imag respectively. Outputs of systolic arrays A1X_B2 and A2X_B1 are added to get either imaginary part of DFT or DST output based on the input given to the controller. Similarly, outputs of systolic arrays A1X_B1 and A2X_B2 are subtracted to get either real part of DFT or DCT output based on the input given to the controller. The DWT outputs can be directly obtained at the output of systolic arrays A1X_B2 and A2X_B1.
As per (4), the DCT computation requires two sets of A.X.B arrays and the result needs to be subtracted. Hence it is clear that the real part of the DxT gives the DCT output if the DCT coefficients are fed.
As per (8), the DST computation also requires two sets of A.X.B arrays and adding the outputs of individual arrays, it may be observed that the output is similar to that of the imaginary part of DxT.
As per (4), the 1-D DWT computation requires only one set of A.X. array for computation by feeding. Hence the other set of AX array available in Unified architecture can be used for computing another set of 1-D DWT or it can also be used to compute higher order DWT.
Mapping of 2-D DWT
The 2-D DWT requires the complete computation of triple matrix product unlike the computation of diagonal elements as in case of sinusoidal transforms. Hence the output of the first array is fed as the input to the second A.X array.
A 2x1 multiplexer chooses the inputs either from the external channel (select=0, and is the case for DFT, DCT, DST, 1-D DWT) or from the output of the first array (select=1 for 2-D DWT) based on the choice of the transform.
Simulation results
The architecture is implemented for N=16, which is square of the integer 4. The systolic arrays A1X and A2X has 4*4 array of PEs A and the systolic arrays A1X_B1, A1X_B2, A2X_B1 and A2X_B2 has 4*1 array of PEs B.
Hardware implementation
The hybrid structure and standalone structures are implemented on FPGA device using vivado. They are shown in Figs. 1, 2, and 3 respectively. The standalone architecture can calculate DCT and DWT simultaneously. Table 1 shows the hardware resource utilization report generated through vivado during Run Implementation stage. The target device used is Spartan-7 xc7s100fgga676-1. The application requires calculation of DCT and DWT, which is done serially on hybrid architecture. The hybrid architecture active usage is equivalent to Isolated DCT architecture, because it doesn’t utilize the systolic arrays A1X_B2 and A2X_B2. Therefore, hybrid active used resources are less than combined standalone structures, which can be inferred from Table 2.

Isolated DCT architecture.

Isolated DWT architecture.
Resource utilization
Resource utilization
The watermarking algorithm proposed in [6] is implemented for N=64 i.e, each frame will have 64 samples of data. Two level DWT is performed on the frame in which 16 values remain. These values of two frames are tabulated in Table 5. Mean of the 16 DCT output values for two adjacent frames is calculated which is used to obtain watermark key by comparing with adjacent frame values. The algorithm is implemented on proposed Hybrid architecture and a Standalone architecture which can compute DWT and DCT at the same time. The results obtained are used to extrapolate for N=1600.
Hybrid architecture has structural dependency, i.e, it cannot execute DWT and DCT at the same time, even if there is no data dependency. Architecture with standalone DWT and standalone DCT doesn’t have structural dependency. i.e, it can run DWT and DCT at the same time, if there is no data dependency between currently running DCT and DWT. In the watermarking algorithm, DCT is computed after two-level DWT computation, i.e, there is data dependency between DCT and DWT.
The DFT implementation requires 2N + 4N2 multipliers, 2N + 4N2 + 2 adders. For DCT and DST implementation, it requires 2N + 2N2 + 1 adders, 2N + 2N2 multipliers. For DWT implementation, it requires M2 adders, M2 multipliers, where M=
Once the architecture computes first set of DWT values, the standalone architecture can start computing first set of DCT and second set of DWT at the same time, whereas the hybrid architecture cannot. It can either compute 1st set of DCT or second set of DWT values. In this case, the second set of DWT is computed first and then DCT is computed as shown in Table 3. It can be observed that, time saved for two frames by standalone architecture is 270 time units which is the time required to run one DCT. Time saved for n frames is (n-1)*270. For N=1600, no. of DWT runs to be computed is 200 and no. of DCT runs to be computed is 25.
Comparison between hybrid and standalone
Comparison between hybrid and standalone
The data is extrapolated for 1025 frames with each frame having 1600 samples. As per Table 4, Hybrid architecture needs 35.47% more time. As DCT and DWT computations are done serially on Hybrid Architecture, more time is required when compared to standalone architecture which can compute DCT and DWT parallelly.
Time comparison for N=1600
The algorithm is simulated for two frames containing 64 samples each as shown in Fig. 4 and the level-2 DWT and mean values of DCT are captured and compared. The average error for outputs of 2 level DWT outputs shown in Table 5 is 0.012785% and for DCT outputs shown in Table 6 is 0.0395%.

Algorithm.
DWT outputs (round off to 6 digits)
DCT outputs (round off to 6 digits)
A hybrid architecture for multiple transforms has been proposed. Implemented Hybrid architecture is verified against the values obtained in Matlab. A watermarking application which uses DWT and DCT is identified. The algorithm sequence is implemented in verilog.
The proposed hybrid architecture occupies less area and has lesser resource utilization. The designed architecture produces the output of transform sequence in order, which avoids reordering the output.The implemented architecture can be used to handle large sized transforms by repeatedly using fixed size architecture for a large number of points without increasing the number of Processing Elements (PEs). Hybrid architecture needs 35.47% more time. There is 22.14%, 16.83%, 9.09% and 23.21% reduction in LUT’s, FF’s, DSP units and IOB’s respectively in the hybrid architecture.
As future work, we want to modify the architecture so that an application can make use of maximum resource in parallel to have an higher yield.
Footnotes
Acknowledgment
We would like to express our deepest gratitude to our collaborator Dr. Mamatha.I from Amrita University, whose constructive suggestions helped to speed up the implementation.
