Abstract
Hand gesture recognition, one of the most popular research topics in human–machine interaction, is extensively used in visual and augmented reality, sign language translation, prosthesis control, and so on. To improve the flexibility and interactivity of wearable gesture sensing interfaces, flexible electronic systems for gesture recognition have been widely studied. However, these systems are limited in terms of wearability, stability, scalability, and robustness. Herein, we report a flexible wearable hand gesture recognition system that is based on an iontronic capacitive pressure sensing array and deep convolutional neural networks. The entire capacitive array is integrated into a flexible silicone wristband and can be comfortably and conveniently wrapped around the wrist. The pressure sensing array, which is composed of an iontronic film sandwiched between two flexible screen-printed electrode arrays, exhibits a high sensitivity (775.8 kPa−1), fast response time (65 ms), and high durability (over 6000 cycles). Image processing techniques and deep convolutional neural networks are applied for sensor signal feature extraction and hand gesture recognition. Several contexts such as intertrial test (average accuracy of 99.9%), intersession rewearing (average accuracy of 93.2%), electrode shift (average accuracy of 83.2%), and different arm positions during measurement (average accuracy of 93.1%) are evaluated.
Introduction
The human–machine interface has always been a prevalent research area. Recently, with the rapid development of flexible electronics and tactile sensing, it is becoming more feasible to apply these techniques in the applications of health monitoring, medical treatments, control interfaces, intelligent robotics, and prosthetics.1–3 Promoted by the development of artificial intelligence (AI), combining advanced sensing methods and AI provides us with the opportunity of manufacturing soft robots 4 and flexible human–machine interfaces5,6 with greater perception ability. Hand gesture recognition is an important component in the loop of human–machine interaction and is widely used in various applications.7–10 Before the massive use of flexible sensors, inertial measurement unit,11,12 surface electromyography (sEMG),13–15 and commercial pressure/flex sensors16,17 are commonly used as wearable human motion signal collecting devices. However, these devices are rigid and inconvenient to wear, therefore may hinder normal motions of the hand and obstruct prolonged wearing for all-day motion monitoring.
Recently, flexible and stretchable sensors have been applied to the task of hand gesture recognition. The sensing principles of these sensors mainly include resistive, 18 capacitive,19,20 piezoelectric, 21 and triboelectric,22–24 which are designed to measure either pressure or strain of different parts of the hand when the gestures are performed. Moreover, flexible high-density sEMG sensors are also developed to acquire electrophysiological signals for movement information. 25 The placement locations of these flexible sensors include fingers,18,23,26 back of the hand, 27 and forearm.19,25 To conduct long-time and daily wearing of the flexible system, the wrist may be a better choice for improving the wearability of the sensors, 24 since it neither obstructs the movements of fingers nor impedes the motions of the forearm. In addition, most of the tendon and muscle groups that are in charge of hand movements run underneath the skin of the wrist, 28 which makes it possible to detect skin and tendon deformation using flexible sensors.
Several studies have explored collecting gesture information through wrist-mounted devices.16,24,28–31 However, these sensors either are rigid or only have a few sensing channels, therefore are uncomfortable to wear and fail to collect abundant information around the wrist. Meanwhile, although flexible pressure sensor arrays with a large number of sensing channels have been extensively studied,32–34 they are rarely explored as a wrist-mounted gesture recognition device. The obstacles to using a wrist-mounted flexible pressure sensor array include: (1) Skin deformation around the wrist is subtle and difficult to capture. (2) When sensors are manufactured into a sensing array, cross talks among channels are hard to eliminate. 35 (3) Conventional machine learning methods require handcrafted feature engineering and thus are not applicable to multichannel sensor data. 36 Hence, sensors with high sensitivity and high signal-to-noise ratio, as well as advanced data processing techniques, are required.
Over the past decade, iontronic sensing emerges as a new sensing modality. 37 The ultrahigh sensitivity can be achieved through supercapacitive phenomenon of the formation of electric double layer 38 between the ionic gel and electrode interface. To further improve the performance of iontronic capacitive pressure sensors, microstructures39–46 are introduced onto the surface of iontronic film or electrodes. This high sensitivity significantly improves the immunity of sensors to noise. 39 Therefore, when sensors are assembled into arrays with high sensing element density, cross talks among channels can be greatly inhibited. Moreover, deep neural networks have been introduced to process time-series data47,48 and array-like data34,49,50 for wearable sensors, which provide a powerful tool for complex feature embedding and classification.
In this article, a flexible wearable hand gesture recognition system that combines a highly sensitive ionic capacitive array with deep convolutional neural networks is reported. The capacitive array has 4 × 8 sensing channels to map the instant pressure, which is produced by the deformations around the wrist, into capacitance values (Fig. 1). The array signals are measured by a customized readout circuit, then reshaped and interpolated into pressure images, and fed into deep neural networks for feature extraction. Seven hand gestures are selected to evaluate our system, and the residual network-based architecture 51 is implemented as a feature extractor and classifier. Our flexible wristband combined with deep neural networks shows excellency in alleviating the influence of intertrial tests, arm position change, and intersession rewearing, which makes it promising in prolonged wearing and donning-and-doffing robustness in the potential applications of remote control, prosthetics, and visual/augmented reality.

The flexible wristband serves as a system to monitor hand gestures. The flexible wristband has a multilayer structure, which contains an iontronic capacitive array with 32 sensing channels, a flexible silicone substrate, a touch bump array, and a PDMS spacer grid. It is placed on the palmar side of the wrist to capture skin and tendon deformation when the gestures are performed. The output signal is collected by a customized readout circuit, and the pressure mappings are reshaped into images, which are interpolated and fed into deep neural networks for feature embedding. Seven hand gestures are selected as the gesture set, and recognition results of intertrial performance, arm position change, electrode shift, and intersession rewearing are analyzed. PDMS, polydimethylsiloxane.
Materials and Methods
The entire system is composed of: two polyethylene terephthalate (PET) substrates each screen printed with a 4 × 8 electrode array, an iontronic film, a polydimethylsiloxane (PDMS) spacer grid, a silicone touch bump array, a flexible silicone wristband substrate, and a readout circuit. The PDMS spacer grid prevents the iontronic film and electrodes from adhesion under high initial pressure when the gesture is finished, and the touch bump array is for concentrating pressure onto each sensing element.
Preparation of the electrode array
The fabrication process is shown in Figure 2A. The 4 × 8 electrode array consists of 32 circular electrodes, each with a diameter of 4 mm. The spacing between each row is 7 mm, and the spacing between each column is 10 mm. The electrode pattern was customized into a template and screen printed with silver conductive ink (CI-1036; ECM) onto a 35 μm-thick PET substrate. Then the silver electrodes were cured in an oven at 130°C for 30 min.

Wristband fabrication process and signal acquisition system overview.
Preparation of the iontronic film
The iontronic film is microstructured with graded intrafillable pillars and grooves templated from sandpaper as described previously 39 to allow for high sensitivity and broad pressure sensing range. To prepare the polyvinyl alcohol (PVA)/H3PO4 solution, 4 g PVA (Mw ∼145,000; Aladdin Industrial Corporation) was added to 36 g deionized water and stirred for 2 h under 95°C. After PVA was fully dissolved and the solution cooled down to room temperature (27°C), 3.3 mL H3PO4 (AR ≥85%; Shanghai Macklin Biochemical Co., Ltd.) was dropped into the PVA/water solution and stirred for another 2 h. After thorough mixing, the PVA/H3PO4 solution was poured onto precleaned sandpaper (Grit: 10000#) and cast using a film casting coater (MSK-AFA-HC100; Shenzhen Kejing Star Technology Company) with a height-adjustable applicator (KTQ-80F; Shenzhen Kejing Star Technology Company) set to the height of 500 μm. Finally, after curing at room temperature for 24 h, the cured iontronic film was peeled off carefully from the sandpaper.
Fabrication of the touch bumps and wristband substrate
The touch bumps and wristband substrate were fabricated by mixing a silicone elastomer (Hong Ye Jie Technology Co. LTD.) with the curing agent (1:1 weight ratio) under mechanical stirring, and then the mixture was degassed in a vacuum chamber to remove the entrapped air. A small amount of silicone color pigments (PMS 2757C; Smooth-On, Inc.) was added to check the mixture state during stirring. The touch bumps were arranged into a 4 × 8 array, and each was in the shape of a cylinder, which has a diameter of 4 mm and a height of 1.9 mm. To fully and precisely cover all electrodes in the array, the spacing between each row of the touch bump array is 7 mm and the spacing between each column is 10 mm. The thickness of the wristband substrate is 1.4 mm, and Velcro holes were reserved for the placement of Velcro. To manufacture both touch bumps and wristband substrate, the mixed silicone was poured into the acrylic molds and cured at 70°C for 15 min, and then the molds were removed to obtain the templated touch bumps and wristband substrate.
Assembly of the wristband
To enhance the recoverability of individual sensing elements under large initial pressure, PDMS spacers were placed between the top electrode array and the iontronic film as support structures. A 50 μm-thick PDMS sheet was cut into stripes, and those stripes were carefully arranged and stuck onto the PET substrate of the top electrode array in the form of a grid, which only left out the area of the electrode. Then the iontronic film was sandwiched between the top electrode array and the bottom electrode array, and this capacitive sensing array was sandwiched between the wristband substrate and the touch bump array. Finally, Velcro straps were fixed through the Velcro holes.
Signal acquisition system description
The signal acquisition system (Fig. 2B) consists of eight voltage followers for voltage stabilization, an eight-channel analog switch for multiplexing, and each output of the analog switch is connected to a capacitance-voltage converter, which converts the capacitance signal to voltage signal. Then the output voltage is sampled by a 12-bit analog to digital converter module, which is embedded in the microcontroller (STM32F303CBT6). The collected voltage signals are immediately average filtered in the microcontroller, and then the filtered signals are transmitted to the laptop through a universal synchronous/asynchronous receiver/transmitter. The sampling rate of the system is 50 Hz. The capacitive array was bonded to a flexible printed circuit (FPC) by the anisotropic conductive film (AC-7813KM; Hitachi) and then connected to the signal acquisition system through an FPC connector.
Data acquisition procedure
During the experiments, the subject sat in a chair with the arm naturally placed on the table. The wristband was fixed on the right wrist of the subject using a tightness-adjustable Velcro, and the center of the capacitive array was roughly aligned with palmaris longus. Throughout the experiments, a trial refers to performing all seven gestures in sequential order, with a 5-sec rest period between each gesture and each gesture repeating five times. A session refers to conducting several trials during one-time wearing without doffing the wristband. To evaluate the performance of the wristband, four contexts were considered: intertrial performance, the impact of arm position change during data collection, impact of electrode shift, and intersession performance.
In the intertrial experiments, three trials were conducted. In each trial, each gesture was performed for 5 s, and the time interval between trials was 10–15 s.
In the electrode shift experiments, the data were collected when the wristband was placed in the original position. One trial was conducted, and each gesture was performed for 15 s. After the data on the original position were collected, the wristband was removed and redonned, and a 5 mm electrode shift in the distal/proximal/ulnar/radial direction was introduced, respectively (Fig. 3A). In these shifted positions, one trial was performed, and each gesture lasted for 3 s.
In the arm position experiments, the data were collected under three different arm positions as shown in Figure 3B. In each position, one trial was performed, and each gesture lasted for 5 s in each trial.
In the intersession experiments, four sessions were conducted within a day. In the first session, one trial was included, and each gesture was performed for 15 s. Between each session, the wristband was removed and there was at least a 1 h rest between sessions. In the other three sessions, one trial was included, and each gesture was performed for 5 s.

Experimental procedures, data preprocessing framework, and network architectures.
The experiments on human subjects were approved by the ethical committee of Peking University (protocol number: 20180602).
Data preprocessing
The data preprocessing framework is shown in Figure 3C. The output of each sensing channel in the capacitive array was individually calibrated. The voltage value V0 of each sensing channel averaged over the 5-s rest period before each gesture begins was used as the initial pressure value after the wristband was placed on the wrist. This initial pressure value was removed when the gestures were performed: Vcalibrated = Vraw − V0. The calibrated sensor data were reshaped to a 4 × 8 array and then interpolated using the bicubic interpolation method, forming a 32 × 64 pressure mappings image. The expended array provides a rough estimation of a more fine-grained pressure distribution mappings and prepares the valid size of input for the deep neural network. Then the value of each pixel in this image was min-max normalized from 0–3.3 V to 0–1. After the data preprocessing procedure, each frame of data can be perceived as a grayscale image.
Model architecture: residual network-based model
The model architecture is based on a residual neural network, which is a classical model in computer vision. The essence of this architecture is introducing shortcuts between convolutional layers, which makes it possible to build a deeper network without sacrificing training accuracy.
The entire network architecture was modified from the original residual network and is shown in Figure 3D. There were 16 convolutional layers in this network. The basic block was composed of two convolutional layers with the kernel size of 3 × 3 and kernel number of N (in our model, N equals 8, 16, or 32), and each was followed by a batch normalization layer and a rectified linear unit activation layer. The output of the basic block was connected to the output of the next basic block with a shortcut. Max pooling (2 × 2) layers were used three times in the network to downsize the feature image, and each downsizing was followed by increasing the depth of the feature image. The dimension mismatch was solved by performing linear projection (1 × 1 convolution). The output of the last convolutional layer was directly fed into a global average pooling layer, and then a softmax layer was used for classification. Spatial dropout (with a dropout rate of 0.5) and L2 regularizer (with a regularization factor of 0.001) were used to prevent the model from overfitting.
Training process
To popularize the dataset with variety, we implemented data augmentation 52 before training. The augmentation pipeline consists of: shift, scale, rotate, Gauss noise, multiplicative noise, blur, Gaussian blur, median blur, change of brightness and contrast, and coarse dropout. In data augmentation, noise and blurs were used to mimic the uncertainty of sensing channels. Shifting, scaling, and rotation were for simulating wristband rotation and shifting on the wrist. Change of brightness and contrast was used for simulating different contact forces between skin and electrodes when gestures were performed in different trials and sessions. Coarse dropout prevents the model from being overly dependent on a certain region of the pressure mapping.
For the residual network, Adam 53 was used as the optimizer. The number of training epochs was 50, and the pressure mapping images were augmented and fed into the network with a batch size of 128. The initial learning rate was 0.001 and shrank by a factor of 0.5 every 15 epochs.
Results and Discussions
Sensing properties
The sensitivity of the capacitive pressure sensor is defined as S = δ(ΔC/C0)/δP, where C0 is the initial capacitance without external pressure, ΔC is the change of capacitance after the pressure is applied, and P is the applied pressure.
The sensitivity is revealed in the pressure-capacitance curve as shown in Figure 4A. The sensor exhibits the sensitivity of 775.8 kPa−1 in the pressure range of 0–4.5 kPa, 179.5 kPa−1 in the pressure range of 4.5–45.5 kPa, and 31.5 kPa−1 in the pressure range of 45.5–190 kPa. Due to the low sampling speed of the Inductance Capacitance and Resistance meter (E4980AL; KEISIGHT), the response and recovery time of the sensor were measured by a customized capacitance-to-voltage readout circuit with a sampling rate of 1000 Hz. By loading and unloading the pressure of 50 Pa, the sensor exhibits the response and recovery time of 65 ms and 59 ms, respectively (Fig. 4B), which is sufficient in the application of hand gesture recognition. Figure 4C shows the sensor performance when loading and unloading the weight of 2, 5, 10, and 20 g with the initial weight of 105 g (∼2 kPa) placed on the sensor, which simulates the detection of skin deformation when the wristband is placed on the wrist with an initial pressure. The stability of the sensor was characterized by loading and unloading the pressure of 175 kPa onto the sensor over 6000 times. As shown in Figure 4D, the output of the sensor shows no obvious drift.

Characterization of the sensor.
On-wrist pressure analysis
A customized capacitance-to-voltage circuit (Fig. 5A) was used in sensing array signal collection, which introduced nonlinearity to the relationship between input capacitance C and output voltage Vout. Therefore, the circuit output Vout cannot be directly mapped to pressure P applied on the sensor based on the P-C relationship in Figure 4A.

To analyze the pressure applied to sensing points on the wristband, we measured the relationship between pressure and voltage directly as shown in Figure 5B. The P-V curve exhibits three stages of sensitivity: 0.022 V/kPa in the pressure range of 0–67 kPa, 0.008 V/kPa in the pressure range of 67–165 kPa, and 0.003 V/kPa in the pressure range of 165–245 kPa. We also analyzed the sensor array output when the hand and wrist were at rest (V0) and when gestures with the largest wrist movement range were performed (wrist extension), as shown in Figure 5C. Each channel of the array was averaged across one experiment trial, and the initial output voltage without applying any pressure is 0.14 V.
As shown in Figure 5C, the maximum output change relative to the initial output voltage (0.14 V) of the sensor array is below 1.45 V. According to Figure 5B, this lies within the first stage of P-V sensitivity (pressure range: 0–67 kPa, output voltage change range: 0–1.45 V). Therefore, in the on-wrist wristband wearing scenario, the input pressure has a linear relationship with the output voltage with a sensitivity of 0.022 V/kPa. This linear mapping between pressure and output voltage also makes it convenient for output signal calibration, where we only focus on the output changes relative to the initial signal value.
Gesture pressure mapping visualization
Seven hand and wrist gestures are selected, which include: s/one/two in American sign language, wrist flexion, wrist extension, radial deviation, and ulnar deviation. All these gestures can be linked with meaningful control commands for drones, self-driving cars, and so on. The reshaped array of wrist pressure mappings and interpolated pressure images of each gesture are shown in Figure 6A. From the images, we can infer the pressure distribution around the palmar side of the wrist when different gestures are performed. Some of the augmented samples are shown in Figure 6B.

Results of
Principal component analysis (PCA) was used for dimension reduction and visualizing data in a 2D plane. The PCA results of the original pressure mapping and features extracted by the deep neural network are shown in Figure 6C and D. In Figure 6C, some data from different categories are entangled together, whereas after the feature extraction by the neural network, features become separable. This shows the excellent feature embedding ability of the neural network.
Hand gesture recognition result
The hand gesture recognition result is evaluated by intertrial performance, arm position change during data collection, electrode shift, and intersession performance. All the results are shown in Figure 7.

Hand gesture recognition accuracies of
Intertrial performance
In the intertrial experiments, the wristband was not removed from the wrist until the experiment was finished. Leave-two-trial-out validation was used to evaluate the performance. Concretely, trials I, II, and III were used as the training set, respectively, and the left-out trials were used for validation. Results of each round are shown in Figure 7A. Without taking off the wristband, training on each trial could all achieve a high validation accuracy near 100%. The confusion matrix of intertrial experiments can be found in Figure 7E.
Arm position change
In the arm position change experiments, the wristband was not removed from the wrist until the experiment was finished. Leave-two-position-out validation was used to evaluate the performance. Results from each round are shown in Figure 7B. Training on positions I, II, and III, respectively, yields validation accuracies above 90% in most cases, except for the case in which data collected from position II were used for training and data collected from position I were used for validation (with a validation accuracy of 84.9%). The averaged confusion matrix over these six validation sets is shown in Figure 7E. From the result, we can see that the arm position change during data collection gives a negligible impact on the system performance.
Electrode shift
The electrode shift consists of four directions: ulnar direction, radial direction, distal direction, and proximal direction. Toward each direction, the wristband was removed and put back on with a 5 mm shift. The data collected from the original position were used for model training. The four-direction shift classification results are shown in Figure 7C. The shifts toward proximal and distal direction can maintain the classification accuracy of 90.6% and 87.0%, while the shifts toward ulnar and radial direction decrease the classification accuracy to 75.2% and 80.1%. This means that obvious electrode shift may affect the wristband performance. The averaged confusion matrix of these four directions is shown in Figure 7E.
Intersession performance
If we assume that under the experimenter's supervision, no obvious electrode shift would occur, this leads to the intersession doffing and redonning experiment. Data from four sessions were collected, and the data from the first session were used for training the model. Without any recalibration of the trained model, as shown in Figure 7D, the classification accuracies on sessions I, II, and III are 100.0%, 97.4%, and 82.1%, respectively. This indicates that with the increase in the time interval between the first wearing and the next rewearing session, the performance might decrease. The averaged confusion matrix of these three sessions is shown in Figure 7E.
Conclusions
In this article, we report a flexible wearable hand gesture recognition system that is based on an iontronic capacitive pressure sensor array and deep convolutional neural networks. The system has a multilayer structure that integrates a flexible iontronic capacitive array, a touch bump array, and a soft wristband substrate. The proposed system highly improves the recognition robustness due to the combination of multiple sensing channels, excellent noise immunity of ionic capacitive sensors, and outstanding feature extracting ability of deep convolutional neural networks. Seven hand gestures were selected as a gesture set, and a 16-layer residual network-based architecture was implemented as a feature extractor and classifier. The system shows small performance degradation under the impact of the intertrial test, arm position change, and intersession rewearing, with the average recognition accuracy of 99.9%, 93.2%, and 93.1%, respectively. Our system provides new insight into fabricating robust and scalable wearable hand gesture recognition devices for future applications in wearable robotics.
Footnotes
Acknowledgments
The authors appreciate the valuable comments and suggestions from the two anonymous reviewers. The authors also thank Shichang Zhang for circuit manufacturing and Yuwen Lu for advice on the writing details of the article.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This work was supported, in part, by the National Key Research and Development Program of China under Grant 2018YFE0114700; in part, by the National Natural Science Foundation of China under Grant 91948302, Grant 51922015, and Grant 81972131; and, in part, by the PKU-Baidu Fund under Grant 2020BD008.
