Abstract
Nowadays, there is an increasing trend towards automated driving. This is supported by both driver assistance systems getting more and more available and powerful, and research for car manufacturing industries. As a consequence, driver hands and feet are less involved in vehicle control. Increasing automation will even let them become idle. Recent gesture recognition mainly focuses on hand interaction. This work focuses on possibilities for feet gesture interaction.
Many gesture recognition systems rely on computing intensive, privacy concerns causing video systems. Furthermore, these systems require a line of sight and therefore visible interior design integration. The proposed system shows that invisibly integrated capacitive proximity sensors can do the job, too. They do not cause privacy issues and they can be integrated under non-conductive materials. Therefore, there is no visible interior design impact.
The proposed solution distinguishes between four feet gestures. There is no limitation to feet movement. Further, an evaluation including six participants and a vehicle legroom mockup shows the system function. This work contributes to the basis of driver foot gesture recognition pointing to further applications and more comprehensive investigations.
Keywords
Introduction
While there is a plenitude of vehicle manufacturers (e.g. Volkswagen [28] or BMW [1]) that track driver hands for multimedia gesture control, there is little research on feet gestures.
Nevertheless, driver feet are one of the first extremities which receive an idle state while driving, e.g. due to cruise control devices. Car manufacturers development tends towards automated driving. Therefore, the driver has even more time to use his feet for other purposes. Those purposes could be human computer interaction. Figure 1 shows the driving automation level guideline.
Partial Automation already includes the system execution of vehicle acceleration and deceleration [23]. Therefore, driver feet become idle. Thus, a technique to use the driver feet for a gesture recognition input device gets investigated and developed in this work.
Gesture recognition, as well as driver monitoring systems, are often based on camera systems. Such systems track driver conditions using optical processing. This means that images, which include the region of interest, have to be captured. Hence, the captured images may include face or hands.
Some research, as presented in Section 2, even include a camera into the driver legroom. Due to this setup, driver feet are monitored. This can cause privacy related issues, since upcoming vehicle network connections may provide security vulnerability.

Levels of automated driving [23].
This work presents a system, which captures driver data, where the abstraction level is too high to gather privacy related information. This is ensured by capturing a small integer array of electric field related data points instead of an image.
Camera based recognition systems require a line of sight [18]. Therefore, their sensor systems have to be integrated into the vehicle interior design. Each camera becomes visible to the user.
Instead, a sensor system that can sense through non-conductive material will be used. Therefore, the system can be integrated into existing vehicle structures. The integration can be facilitated without visible design impact. Furthermore, the sensing system can be included directly below visible covers. Those covers can consist of textile and may be covered by driver extremities without data loss.
The following list shows the stated contributions:
A new way of driver vehicle interaction using feet gets explored
Driver feet can be tracked using capacitive proximity sensors
The system will distinguish between four driver foot gestures
The system will be integrated invisibly below existing vehicle structures
The sensor system and setup does not cause privacy issues
The recognition gets facilitated using using capacitive proximity sensors
The system function is shown in an evaluation including six participants
Foot gesture recognition is in the focus of related work. This is presented in Section 2.1. This section covers foot gesture recognition in general. Therefore, there is no constraint for looking only at automotive applications. Section 2.2 shows research on capacitive proximity sensing applications in automotive applications.
Foot gesture recognition and tracking
Similar like other recognition applications [30,31], the geometric foot shape and their material characteristics provide several opportunities to track them. Furthermore, researchers rely on the fact that foot movement implies whole body movements due to kinetic connection.
Scott et al. investigate this fact by using a hip located accelerometer [24]. They use a mobile phone accelerometer. It is located in the back pocket of the trousers. The system distinguishes between ten foot gestures. They rely on the rotation around the heel and toe. This enables rotation and tapping gestures. Utilizing a Naïve Bayes classifier, they reach an impressive accuracy between 82% and 92%. They point to the fact that the user would not always have his mobile phone in his pocket. This limits the recognition application activation time. Furthermore, they state the issue that other, none-gesture intended activities like walking could induce gesture like acceleration pattern [24].
Another paper uses the gesture dependent pressure distribution between feet and floor. Instead of mounting pressure sensors under the floor, Fukahori et al. include pressure sensors into the users’ socks. This enables the usage of any common floor without further manipulation [12].
They point out that the system can be used in crowded public spaces like trams. Crowded spaces could prevent users from using their hands. In contrast to the previously presented paper, their gestures do not require visible foot movement. The user has to relocate his foot pressure from toes to heel or root side. Given this condition, they conducted an evaluation providing a complete multimedia application, controlled by feet [12].
The gesture set depends on the current application state. In dependence on the state, the recognized gestures range from two, like answer phone call or ignore phone call, to five gestures during browser navigation, map navigation or media player control. They test their application in a study with five subjects. The study results in an average accuracy of 91.3%. The results are based on stratified cross validation. Further, the study results in an accuracy of 56.2% for leave one participant out validation. Due to their results, they state that the system is strongly user-dependent [12].
Many gesture recognition systems, e.g. hand tracking systems, use cameras to monitor the concerned object. Tran et al. present a system that uses a camera to track the driver feet while driving [26].
Their aim is to predict foot movement towards pedals. This indicates the driver pedal press intention. They use optical flow processing for feature generation and hidden Markov model for prediction. Tran et al. present a study with twelve participants. All of them have to fulfill the same tasks. The aim is to distinguish between seven pedal application conditions: Neutral, meaning no intention, brake engaged, release brake, moving towards brake, accelerator engaged, moving towards accelerator and accelerator released [26].
The evaluation of the prediction performance results in a mean correct classification rate of 93.77%. They conduct a leave one participant out validation. This validation results in decreased performance. They are able to predict pedal application 133 ms prior to actual pedal-foot contact with a score of 74% correctly classified samples [26].
Automotive capacitive proximity sensing applications
Capacitive proximity sensing (CAPS) found its way towards automotive interior at several original equipment manufacturers (OEM). Volkswagen attaches capacitive touch sensors to infotainment displays. This approach enables them to provide gestures like swiping and pinch. Moreover, the CAPS are used to predict the proximity of hands towards the infotainment for automatic display adjustment [29].
In two research papers, HUDConCap and AuthentiCap, eight CAPS electrodes are included into an ordinary steering wheel. This facilitates to capture driver hand movements. Those papers include an algorithm consisting of support vector regression at its core. The regression models enable to transduce CAPS output into a planar hand movement. Based on this tracking, a new way of Head-Up Display (HUD) interaction is provided in HUDConCap. And a new way of touch-less, adaptable driver authentication mechanism is provided in AuthentiCap [9,10].
Braun et al. include 16 CAPS electrodes into an ordinary automotive car seat. Their CAPS arrays are mounted in the backrest, the seating and the headrest. All of them are invisibly covered by the seats cushion or a slip cover. Due to this setup, they are able to distinguish if the seat is occupied or not. Furthermore, two classes of driver posture and a proper headrest position can be detected. Further, the seat adjustment is supported by automatic measurement of the driver leg length and position. Due to Braun et al.’s user study, using a seat mockup, they show a significant relation between seat adjustment, driver size and CAPS output [2].
Ziraknejad et al. focus on a particular driver seat part: The head restraint. They introduce a CAPS setup with a large-scale analysis of CAPS output depending on temperature and electrode shape. They do not only aim to detect the driver head, they also include head restraint actuation elements which move the head restraint into optimal driving position [32].
The system is equipped with three CAPS electrodes, equally arranged around the head restraints center point. This results in a three-pointed star shape. They setup a neural network to determine the current head restraint related head position. It is a three-layer feed-forward neural network. Furthermore, they evaluate the system performance splitting up the samples into 75% training samples and 25% validation samples where 100% samples equal 4000 data instances [32].
All samples consist of the three CAPS electrode ratios. The current head position is selected as label. They allow movement of 14 cm, 7 cm and 7 cm (vehicles pitch axis, roll axis and yaw axis). Each movement shows a mean Euclidean distance error of 0.33 cm. This allows proper head restraint adjustment [32].
Capacitive proximity sensing systems exist outside automotive domains. Braun et. al. used CAPS to equip ordinary office chairs. This furniture is capable of exercise tracking, respiration rate measurement and position tracking. Moreover, it is capable of activity tracking. It can be extended with acoustic signals [3–5].
Contribution
The driver foot position should be detected using a given vehicle structure. A constraint is to use invisibly integrated sensors. Therefore, Section 3.1 shows the selection of an appropriate vehicle structure. Additionally, an adequate sensor topology is designed in Section 3.1. Section 3.2 shows the gesture selection process. Section 3.3 shows the generation of a model that predicts driver foot gestures. This is done by using preprocessed CAPS data. The data preprocessing is presented in Section 3.3.
Vehicle structure and sensor selection
Driver-initiated foot gestures shall be captured. Therefore, a vehicle structure close to the driver feet is selected. The selected structure is the legroom floor and the legroom side walls. A common vehicle legroom is covered with non-conductive textile or other synthetic material. This enables the integration of capacitive proximity sensing (CAPS) electrodes under those covers.
The desired invisible sensor system integration, which should enhance design and privacy features, emphasizes the selection of CAPS sensors. The driver may process the gestures in all degrees of freedom. This includes translation and rotation in three directions.
This wide spatial condition is covered by a three-dimensional CAPS electrode array topology. The topology is shown in Figure 2. The driver performs gestures partly on top of the sensing electrodes. Due to the sensor assembly below textile covers, the sensor array is invisibly integrated into the vehicle interior design.

Driver legroom with sensors and electrode topology.
Natural human gestures and human vehicle interaction gestures may differ concerning executed movement and recognized intention mapping. Different gesture types and the related difficulties are shown in Section 3.2.1. Afterwards, selected gestures based on the gesture characteristics are shown in the second subsection (3.2.2).
Gesture characteristics
Gestures require the user to remember predefined movements. Proper gesture execution leads to the desired system stimulation. The cognitive load, applied to the user remembering those gestures, depends on the gesture types.
Hummel et al. [17] provide an explanatory gesture type overview. On the one hand, there are symbolic gestures. Five examples are shown in Figure 3. This gesture type is related to the systems detection abilities. Touch screens usually provide symbolic gestures like pinch-to-zoom. Pinch-to-zoom is shown on the right of Figure 3. As stated by Cassell [7], those kinds of gestures are not intuitive to the user. Therefore, he has to learn them similar to a programming language command.
On the other hand, there are gesticulations while talking. Concerning those gestures for gesture recognition systems would include spoken language processing. Spoken language processing would be included because they may require situation context [17].
Lastly, there are act gestures [21]. Those gestures relate to the intention directly. The user could describe the object shape with his hands or point to objects which should be selected. Therefore, the system has to know object locations. Thus, they require context information. Furthermore, the user can describe object shapes with his hands. Therefore, the computer must distinguish between contexts to find proper user intention [17].

Examples of symbolic gestures [17].

Häuslschmid et al. gesture set [15].
A more technical approach shows Häuslschmid et al.’s investigation on freehand and micro gestures. Both gesture types and their application concepts are shown in Figure 4. The gestures are used to control the car multimedia system. Furthermore, the participants execute gestures and conduct driving maneuvers simultaneously [15].
Freehand means that the user does not have contact to any surface while conducting gestures. Micro gestures provide a hand rest for the gesture conducting hand. Their results show that micro gestures have advantages concerning lane changing tasks while driving. Micro gestures show disadvantages concerning gesture completion. This points to a higher micro gesture security and a higher freehand gesture success [15].
Those described gesture characteristics relate to hand gestures. Thus, they have to be projected on foot gestures. Foot gestures with one fixed foot pivoting point are used. This relates to Häeuslschmid et al.’s investigation. Furthermore, free moving driver feet may stress the driver muscles. This leads to concerns about the driver ability to conduct those gestures frequently.
Secondly, Hummel et al.’s gesture review seems to propose the use of act gestures or gesticulation while talking. These gestures relate to natural human gestures. Therefore, those gestures seem to show better user acceptance. Nevertheless, those gestures are context dependent. Due to the limitation of foot gestures, and the constraint to use pivoted gestures, those gestures are dedicated to specific applications. Those applications have to include the foot gesture to provide natural interaction while driving.
A further point is the number of gestures. While a large number of gestures provide fast and specific system control to the user, Kern et al. [19] show that increasing the number of control interfaces increases the cognitive load on the driver. Thus a small gesture set gets selected to avoid this issue.
In compliance with the constraints, defined in Section 3.2.1, already existing gestures are selected. The survey on foot-based gestures by Velloso et al. [27] presents eleven gestures. Foot rotation, tapping, two-foot movement, swiping and more are included. Due to the constraint of one fixed pivoting point, the foot gesture number is reduced to seven gestures.
This is a first investigation on driver foot gestures using CAPS. Thus single foot gestures are selected. Therefore, five gestures remain. One of them is the so-called shake gesture [27]. This gesture pivots around one single point, the foot center. The user has to relieve the leg. Thus, this gesture is excluded. Therefore, the resulting gesture set consists of four gestures, Toe Tap, Heel Tap, Toe Rotation and Heel Rotation. These are shown in Figure 5.
All gestures require single axis coordination capability of the driver. This means that every gesture consists of a single movement. For example toe tap requires a rotation around the heel where the toes move up and down. Toe Rotation requires a rotation around the toes.
This system addresses driving situations in which the driver feet are not engaged. For example driving with cruise control or automated driving situations. Therefore, the gestures refer to both feet, right and left.

Selected foot gestures [27].
Linear normalization is applied to the CAPS raw sensor output. The considered sensor value minimum is automatically adapted at system runtime. The maximum value is an empirically determined value span. It is added to the minimum value. Furthermore, the span is considered constant for all measurements. All values between the minimum and maximum value are linearly interpolated between zero and one. This interpolated data will be called MinMax in further reading.
MinMax will be one part of the classifier feature vector. An additional prediction of planar feet position should increase the performance (concerning accuracy). Thus, planar feet movement gets added to the feature vector.

Position prediction model.

Gesture prediction model.
Planar feet position is derived of the MinMax data. This is facilitated using six random forest regression models, one for each position measure. Afterwards, those six random forest regression models and the MinMax data are used as input vector for further six random forest regression models. This shall increase the planar feet position prediction accuracy, since it includes the position of both feet. The feet position is called CAPSPos in further reading.
CAPSPos includes two-dimensional translation perpendicular to the vehicle yaw axis and rotation around the vehicle yaw axis. This foot position tracking model is shown in Figure 6.
Now the data set is comprised of MinMax and CAPSPos. MinMax gives information about objects distributions in legroom space. CAPSPos provides approximations about the feet position. All data provide information about a single time step. But gestures consist of sample series. A gesture is a temporally linked combination of foot positions. Therefore, a feature vector consisting of multiple samples over time is selected as gesture representation.
Every user has an own interpretation of motion gestures. Therefore, the gesture duration of a particular user is unknown. Thus, the sample number of the time series remains an empirically defined parameter. The value will be determined during evaluation in Section 5. All users minimum gesture processing times will be collected. Afterwards, the feature vector will have the median duration value times MinMax plus CAPSPos.
A random forest classifier is used for gesture prediction [6]. The model shows bias to generalize well. Moreover, the model training and validation can be processed in adequate processing time. The data processing model is shown in Figure 7. The labels “tree n” represent random forest models. Further, “CAPSPos t0…tn” represents CAPSPos at different time steps. Similarly, “MinMax t0…tn” represents MinMax data at different time steps.
The concept presented in Section 3 has to be proved. Thus, a setup in which gesture recognition performance can be evaluated has to be built. This assembly has to ensure a driver feet space representation. Furthermore, the collection of sensor data, to train the random forest classifier and validate the model, has to be facilitated. Thus, a vehicle legroom mockup was built. It is presented in Section 4.1.
The selection of a proper CAPS system that fits into the selected vehicle structure is shown in Section 4.2. Section 4.3 shows the actual CAPS system implementation into the mockup. Furthermore, Section 4.4 shows the software concept implementation. Section 4.5 shows the feet position labeling process application.
Mockup setting
The driver legroom space is represented by a wooden box. Its inner dimensions consist of 60 cm depth, 40 cm width and 60 cm height. Figure 8 shows a comparison of a facsimile vehicle copy on the left and the mockup on the right. The mockup includes a convenient car mat. Furthermore, an original car pedal setup is included. The pedal setup consists of clutch, brake and accelerator pedal.
In real cars, the driver feet movement is limit by the side door carrier on the left side and the center console on the right side. The mockup restricts feet movement through wooden walls. Furthermore, the driver movement towards the pedals is limit by his own anthropometric constraints. The mockup does not limit upward driver feet movement. All selected gestures rely on at least one foot fixture on the car mat. Therefore, there is no significant spatial movement towards the legroom upper constraints.

Driver legroom mockup.

OpenCapSense toolkit [13].
A proper capacitive proximity sensing setup is required. Almost any microcontroller is capable of capacitive sensing (requires a high resistive resistor). Nevertheless, a more sophisticated off-the shelf solution is selected: The OpenCapSense toolkit [13]. Figure 9 shows the OpenCapSense toolkit. It provides sample rates up to 1 kHz. The default toolkit sample rate of 25 Hz is used. It consists of one main controller board which includes a processing unit. This unit measures the sensor capacity. Each sensor contains a sensing electrode and optional shielding.
All sensing electrodes have a congruent sized shielding on the back. Furthermore, the board runs in loading mode state. Loading mode means that a single electrode for electric field sensing is used. Grounded objects, for example humans, that move towards the electrode change the electric field. Thus, they influence the measurement. Figure 10 shows loading mode configuration on the left and shunt mode on the right. Both modes can be compared to a plate capacitor. The two plates in loading mode are comprised of the grounded object and the sensing electrode. In shunt mode, the sensing system provides both plates. The grounded object acts like a shield.1
See [13] for further information about OpenCapSense toolkit and capacitive proximity sensing modes.

Capacitive proximity sensing modes. Adapted from [13] Fig. 2.
The sensing electrode and the sensor are connected by a coaxial two wire line. The cable core is connected to the sensing electrode and the woven copper shield is connected to sensor shielding.
The mockup is equipped with sensing electrodes equal to the proposed concept in Section 3. Even if the concept suggests twelve sensing electrodes, their number is reduced to eight. Eight sensing electrodes matches the maximum number of OpenCapSense toolkit channels. Thus, the sensing electrodes under the car mat reduce from eight single electrodes to four. The sensing electrodes number reduction should shrink the resolution of planar positions on the ground.
The sensing electrodes and the reduction is sketched in Figure 11. The reduction of sensing electrodes should decrease the amount of CPU usage. It decreases because the measurement input array shrinks. Thus, a smaller input vector has to be processed. If the gesture recognition accuracy were below 90% during evaluation, the number of electrodes would be increased. This would be enabled using a second OpenCapSense toolkit.

Topology comparison between concept and mockup implementation.
The foot gesture recognition model includes a supervised learning based model. Therefore, labeled data is requires. The labels depend on the user’s gesture intention. Therefore, the user must be able to communicate gesture execution.
Thus, a gesture finished button is added to the test manager application. It is controlled by the test supervisor. During execution, the application shows an image of the required gesture. The subject should start the expected gesture. After he finished the gesture, he reports this to the test supervisor.
The test supervisor uses the gesture finished button to let the application know that the user has finished the gesture. The time between the pop-up of the gesture image and the click on the gesture finished button need not be congruent with the actual gesture execution time. Furthermore, the subject may report unintended gesture execution. This cannot be captured by the application during evaluation. Therefore, the recorded time span between gesture start and gesture end is a hint for gesture position in data.
Thus, the measured data has to be checked manually. Doing so, foot movement start and stop are labeled. This approach enables to filter idle time at start and stop from the gesture samples.
While the users process the evaluation, legroom images are captured. Each legroom camera shot is accompanied by a capacitive proximity sensing data sample. The measurement series is stored in a comma separated text file. It contains the complete capacitive proximity sensing data samples and the path to the stored images. Furthermore, each row contains a time stamp for each sample and labels for the used foot and the intended gesture. The camera view and a sample legroom camera shot is shown in Figure 12.

Mockup view of used camera to capture driver feet.
The whole application is written in C# using the .NET framework. Moreover, Weka [14] is used for model training. Weka provides a Java programming library. IKVM [11] provides libraries to use Java libraries in .NET applications. Thus, IKVM is used to combine the trained Weka models and the C# application.
Besides the gesture labels, the feet position is used in the gesture recognition model feature vector. Therefore, the feet position has to be indicated with sufficient accuracy. To do so, the mockup camera captures a birds-eye view on the driver feet. In a first shot, the data was labeled without markers using algorithms based on optical flow, similar to Tran et al. [26].
A spot check at the resulting labels confirmed that the labels show too many deviations from the true feet position. Therefore, a color tracking-based approach is investigated. Every user had to wear colored topcoats on their shoes. Afterwards, the input images are filtered separately based on the applied colors (green left, red on the right). This results in blobs.
The blobs are filtered for appropriate size. Afterwards, a minimum area rectangle is fitted into the blob. Furthermore, an ellipse is fitted into the blob. The intersection between the ellipse major axis and the fitted rectangle mark the target planar position. The ellipse major axis angle is used as the concerned foot angle. Figure 13 shows a sample camera view with fitted ellipse. Label (1) shows the resulting position point. Label (2) shows the predicted position regression model x and y position.

Position labeling sample.
The evaluation consists of an evaluation setup, presented in Section 5.1. The setup is the basis for further data capture presented in Section 5.2. Afterwards, the collected data is used to train several supervised learning-based models to provide results about the system performance in Section 5.3.
Evaluation setup
The prototype, presented in Section 4, is used in the evaluation. It consists of acceleration, brake and clutch pedal. Furthermore, it provides an in-vehicle cabin equal space to the user feet. The feet cabin prototype is covered with a wooden plate preventing a line of sight between the user and his feet.
On top of the wooden plate, a monitor provides information about the further evaluation processing. It is capable of showing the current recognized feet position. Furthermore, it can show gesture figures that shall trigger the user to start the corresponding gesture.
The software application, developed during this paper, is capable of presenting gesture images on the monitor. Furthermore, it is able to capture all sensor data (camera and capacitive proximity sensors). It includes the ability to set flags for gesture start and stop by subject demand. Furthermore, it can advise the user to move his or her feet around. This is required to capture sufficient data without gesture request.
Evaluation procedure
Six subjects participate in the evaluation. Each subject had to reveal information about his or her age and shoe size. Before the procedure, each participant received the same introduction. For introduction, the instructor gave the users the ability to inspect the prototype without interaction.
No user was allowed to access the legroom space before the measurement started. Each gesture was presented to the subjects and a sample of how this gesture’s appearance is intended, shown by the instructor.
Afterwards, the measurement is started without user feet inside the cabin. The subject moves the feet into the cabin and starts moving his or her feet in the cabin. The monitor pop-ups the recently presented gesture images. Subsequently, the subject starts to move his or her feet in an idea of how the gesture should look like. Moreover, the subjects are allowed to inform the instructor if they mixed gestures or unintentionally did another gesture.
Each of the four gestures is presented 20 times to each subject, ten gestures per foot (10 times left foot, 10 times right foot). This results in a sum of 480 gesture samples. Additionally, 480 none-gesture samples are collected. Each subject provides 80 none-gesture samples distributed over the captured data. None-gesture samples exclude gesture sample data.
Due to this approach, the class distribution consists of 50% gesture samples and 50% none-gesture samples. Moreover, the gesture samples consist of four equally distributed gesture types. Each one contributes 12.5% to the whole data set. Due to this approach, the data set is balanced concerning gesture and none-gesture samples. It is unbalanced concerning particular gestures. This imbalanced setup is selected to decrease false positive rate concerning gestures as positive class. Furthermore, plenty of none gestures are included into the data, because none-gestures are arbitrary movements. Therefore, they could include gesture like pattern.
Evaluation results
The evaluation is split into two parts. Section 5.3.1 shows the evaluation of the automated position labeling process. This includes the planar feet position prediction.
Section 5.3.2 shows information about the subject, the distribution of their conditions and the gesture recognition evaluation results. Afterwards, the gesture recognition performance of the described model is analyzed. This is done by using the collected data of this study.
Validation results
Validation results

Position prediction cumulative residual distribution.
As described in Section 3.3, color markers are applied to the subject feet. This is done to track the true feet position. Table 1 shows the random forest models performance split into different feet positions. As already stated, the labels are derived from the mockup camera image capture.
The position labels are abbreviated. XL refers to left foot x-axis position, YL to left foot y-axis position. XR and YR refer to the same axis of the right foot. Furthermore, AL and AR refer to the left and right foot angular displacement. The translational positions show a correlation coefficient (R2) greater than 0.99. The angular displacements show a R2 greater than 0.98.
Furthermore, the mean absolute error ranges from 5.64 to 21.35 pixels (concerning an average of 23.26 pixel per centimeter, this refers to 0.24 cm to 0.92 cm) for translational movement and from 6.54° to 9.79° for angular displacement.
Figure 14 shows the cumulative residual distribution function for each position quantity. The histogram results of 100 bins. All plots include labels for the mean value (μ).
Subject summary
Table 2 shows a summary of the participants characteristics. The subject group consists of three male and three female persons. Furthermore, their European shoe size ranges from 38 to 46. The transition to US-M shoe size is derived of a table and no direct participant measurement. Age ranges between 29 to 59. Each participant provided his or her own shoes.
Evaluation subjects
Evaluation subjects
User C is the only user who reported unintentionally mixed gestures. He provides two times “Toe Tap” instead of “Toe Rotation”. One time “Heel Rotation” instead of “Heel Tap”, one time “Toe Rotation” instead of “Heel Tap”. One time “Toe Rotation” instead of “Heel Rotation”. One Time “Heel Rotation” instead of “Toe Tap”. Additionally, he adds one “Toe Tap” gesture before a “Heel Tap” gesture. All those unintentionally mixed gestures are added to the gesture sample set. Therefore, the result sample data set consists of gesture samples presented in Table 3 (HR = Heel Rotation, HT = Heel Tap, TT = Toe Tap, TR = Toe Rotation, N = None).
Number of gesture samples
As stated in Section 3, the feature vector length has to be defined. Therefore, the subjects processing time for each gesture is analyzed. The analysis results in a median value of 28 and mean value of 27.5 data points. Thus, a value of 28 is selected. Therefore, the feature vector length is 28 times eight CAPS channels plus six positions from capacitive proximity sensing models. This results in 392 entries for each feature vector.
All subject data is combined into one data set as basis for random forest training and evaluation. The model validation is processed using tenfold cross validation. Table 4 shows a confusion matrix of the results.
Confusion matrix
Confusion matrix
In Table 4 the gestures are represented by their abbreviation (HR = Heel Rotation, HT = Heel Tap, TT = Toe Tap, TR = Toe Rotation, N = None). The number of correctly classified instances is 894, which results in 93% proper results. Furthermore, the weighted average true positive rate is 0.93 while the weighted average false positive rate is 0.035.
Figure 15 shows the receiver operating characteristic (ROC). ROC is a tool to evaluate a classifiers performance. It shows the true positive rate (tpr) on the y-axis and the false positive rate (fpr) on the x-axis. A perfect classifier would have a tpr of one and a fpr of zero. Additionally, the area under ROC curve (AUC) can be computed. AUC is the ranking accuracy.2
See [8] for further information about ROC and AUC.

Receiver operating characteristic (ROC).
Due to 93% positive classified samples, no further CAPS electrodes are added to the system. Instead the current hardware setup gets accepted as is. Nevertheless, a more sophisticated classification model could improve the results.

Heel rotation sample angular subject characteristics.
In particular, models like long short-term memory neural networks [16]. These models are designed to handle sequential pattern due to their ability to maintain an internal state. Another approach could include a two-level classification process. At first, level one could distinguish between none-gesture and any-gesture. Afterwards, a second classifier would distinguish between specific gesture types. This reduces the number of classes for one classifier. Thus, this could result in better performance.
Moreover, the features depend partly on linearly normalized capacitive proximity data. This neglects the fact that the basic capacitive change, concerning a parallel plate capacitor model, is reciprocal to the object distance.
Figure 16 shows one sample heel rotation gesture of all subjects. The diagrams on the left show the subject’s left foot angular displacement. Each image on the right of Figure 16 is taken from the evaluation. They show the minimal angular displacement on the left and the maximum angular displacement on the right. Since the feature vector length is constant, each diagram consists of 28 data points.
This figure shows the differences of the subject’s gesture interpretation. While Users E and F fulfill more than three feet displacement repetitions, User C almost does not fulfill one complete foot rotation repetition. Moreover, the peak-to-peak values differ from user to user. While User A shows a peak-to-peak value of 26°, User D shows a peak-to-peak value of 62°, which is a plus of 138.5%. Overall, the variance is 12.6° and the average value is 41.47°.
Therefore, the concrete gesture execution is very individual for each subject. A possible approach to solve this issue and improve recognition could be an enduring user monitoring and to detect general gesture dependent characteristics. Those characteristics could be noticeable foot rotation compared to the usual user movement, concerning heel or toe rotation gestures.
The evaluation concentrates on the system relevant gesture recognition. It does not show any applications for the four proposed gestures. Therefore, the proposed gesture meaning has to be analyzed. In particular, swiping gestures could be a practical touch-screen based example. As compared to the proposed gestures, heel rotation and toe rotation could approximately equal the swiping motion.
This movements could be used to switch between two states, like play lists in media control. Therefore, it can be related to the acknowledgment of dialogs or music title selection. In dependence on human memory capacity to retrieve those gestures, these four gestures could be used for non-natural movement related control. In this case, the users would have to learn the gesture function in dependence on the system.
Apart from gesture recognition, the foot position tracking system seems to provide reasonable predictions. The translational position error, relative to movements along the vehicle roll and pitch axis, is less than one centimeter. In addition, the feet angle prediction error, with respect to the rotation around the vehicles yaw axis, is less than 5.16°. The maximum rotation of User D is at 61°. Therefore, this results in 8.5% angular displacement error.
Thus, this tracking algorithm does not seem to capture angular displacement properly. Nevertheless, the translational movement tracking could be used in further human machine devices. Those devices could require feet tracking data for pedal usage prediction. Furthermore, these devices could monitor the feet position during takeover request at automated driving tasks.
The position prediction model validity is analyzed in Figure 14. All cumulative residual distributions are centered around zero. This indicates a valid model. Furthermore, the relatively small μ values indicate a model without bias. Nonetheless, the translational position tracking seems to perform better than the angular displacement tracking model.
Besides the evaluation of the technical recognition system, the question is if those foot gestures lead to practical applications. Hands are used frequently in human communication. Therefore, they provide a natural basis for gesture-based control systems in vehicles. However, natural foot gestures relate to implicit movements based on feelings (impatience, insecurity, boredom, [25]).
For example humans usually do not use their feet to point towards objects. Further, they do not draw object shapes with their feet during communication. Feet movement during communication is usually implicit. This means that people signal emotions with their feet. A bored person could automatically execute the heel tap gesture [20].
A person which holds back strong feelings might lock his or her ankles [20]. This could be used to gather information about the driver state instead of foot gestures for explicit human vehicle interaction.
Rus et al. already investigated emotion capture using capacitive proximity sensing. They use furniture equipped with capacitive proximity sensing to detect whole body movement and posture. Afterwards, the emotions are derived of the movement and posture [22].
Intuitive natural conduct of the proposed foot gestures cannot be presumed. The user would have to memorize the provided gestures. But, a set of four gestures is relatively small. Therefore, users could retrieve the gestures after short training time. This is supported by the selected gestures. The gestures are based on the same feet movement with different angular displacement (yaw and pitch) and different pivoting point (heel or toe).
The proposed system reaches a recognition rate of 93% correctly classified gestures. This is an indication that it can provide robust information. Nevertheless, further subjects are required to prove this. Moreover, the designed classification model, consisting of random forest classifiers, could be improved with a different model. Those models could focus on the sequential structure of gestures. Furthermore, the model used neglects the anthropometric movement restrictions of driver feet.
The second statement of this work is that the system can be included invisibly into existing vehicle structures. Therefore, a vehicle legroom mockup was successfully built. The floor sensing electrodes are covered by a car mat. The influence on the sensor output has no significant effect on the ability to distinguish gestures. Furthermore, every vehicle consists of a driver legroom floor, center console and doors. Thus, a vehicle structure that exists in every common car is used. Nevertheless, a mockup is used for evaluation. Therefore, the next step is the system integration into common cars. This has to be done to prove the system function while driving and under less lab-like conditions.
The system is able to detect gestures, but the small array (eight) of sensing electrodes cannot provide further information than changes in the electric field. Therefore, a driver image cannot be reconstructed. Nevertheless, a survey asking drivers for privacy concerns using capacitive proximity sensors would be required to give certainty.
A further contrast to camera-based systems is the usage of capacitive proximity sensing for gesture recognition. While camera or infrared based systems proved their ability to capture foot gestures, the proposed system is able to use less computation consuming capacitive proximity sensors.
Due to this research, further classification models have to be investigated. This can improve the recognition rate. Additionally, further subjects are required to collect data for model training and validation. If the recognition rate can be increased and further subject data will be collected, the system is ready to be moved into a vehicle to test the system in reality. Furthermore, the driver’s ability to use foot gestures while driving with cruise control and without has to be investigated.
Further, the gesture effect on driving has to be evaluated. A usability study could investigate reasonable events caused by foot gestures. In particular, the study could compare hand and foot gestures to detect driver preferences. This should point to suitable applications. In particular, gestures that are suitable for multimedia settings control or gestures that are suitable for accepting and declining calls.
Another path is the investigation of further gesture types. Fukahori et al. [12] show an interesting direction on further gesture types concerning foot pressure distribution. This approach can be projected on the vehicle system. In particular, the car mat on the legroom floor may act like a flexible buffer between driver feet and sensing electrodes. Therefore, the system of Fukahori et al. could be adapted to the feet pressure distribution at the legroom floor. In this case, the capacitive sensors would detect the feet pressure distribution. Furthermore, no shoe integrated sensors would be required.
Hand gesture recognition has many applications in vehicles. In contrast, foot gesture recognition is focused on pedal error prediction. Due to this fact, appropriate applications for foot gestures have to be defined. Furthermore, due to the selected reduced gesture subset, those applications could include context dependent gesture control. In particular, gestures could control navigation system menues.
This task can have a heavy cognitive load increase to the driver. In addition, driving tasks can stress the driver, also. Thus, it could not be useful in driving tasks. But the foot gestures could be used as support to hand gestures. The support could be similar to piano pedals. This would lead to a new way of ubiquitous interaction which has to be studied in further research.
