Abstract
Socially interactive systems are embodied agents that engage in social interactions with humans. From a design perspective, these systems are built by considering a biologically inspired design (Bio-inspired) that can mimic and simulate human-like communication cues and gestures. The design of a bio-inspired system usually consists of (i) studying biological characteristics, (ii) designing a similar biological robot, and (iii) motion planning, that can mimic the biological counterpart. In this article, we present a design, development, control-strategy and verification of our socially interactive bio-inspired robot, namely - Telepresence Mechatronic Robot (TEBoT). The key contribution of our work is an embodiment of a real human-neck movements by, i) designing a mechatronic platform based on the dynamics of a real human neck and ii) capturing the real head movements through our novel single-camera based vision algorithm. Our socially interactive bio-inspired system is based on an intuitive integration-design strategy that combines computer vision based geometric head pose estimation algorithm, model based design (MBD) approach and real-time motion planning techniques. We have conducted an extensive testing to demonstrate effectiveness and robustness of our proposed system.
Keywords
Introduction
Robotic systems that employ human-like social cues and verbal & nonverbal communication modalities are called socially interactive interfaces/robots [1]. Socially interactive robots are important for domains in which primary function of a robot is to interact socially with humans. These socially interactive systems are used in variety of applications, e.g. in multimedia [2], video teleconferencing [3], distance learning [4], health care [5], etc. In this work, we propose a design and control strategy of a novel socially interactive bio-inspired system - named - Telepresence Mechatronic robot (TEBoT). TEBoT is specifically suitable for tele-presencescenarios.
The technology used in socially interactive systems follows a standard layout consisting of two main components: 1) feedback motion control algorithms, and 2) planning of desired movement. The combination of both allows researchers to design systems that can move in a desired way. These motions are usually pre-specified or planned dynamically online. Among many applications, the task of mimicking a biological-system is a subject that still poses many challenges to researchers [6].
Over the years, a number of socially interactive robots are designed, which vary from single to thirty degrees of freedom [7]. Despite the success of many systems, (e.g., ASIMO by Honda [8], etc.) there is no commonly agreed design procedure that can be employed fully; especially when it comes to dynamic modeling and analysis of a human body. This problem becomes more trivial when a computer vision based recognition, segmentation, modeling, and analysis are needed. Considering these problems, we have revisited the question by building a socially interactive bio-inspired robot which considers a human-in-the-loop as a designer, as an observer and as an interaction partner [1].
We have built a test-bed: a novel interactive head/neck robot - named - Telepresence Mechatronic Robot (TEBoT). The TEBoT design is inspired by a real human neck where unique mechanical design of a TEBoT is build by studying the real head/neck dynamics. Our intuitive design approach is based on model based design (MBD) that has benefits of model analysis, calibration, control and automatic code generation. The TEBoT is controlled by bringing a real human-head-in-the-loop of closed loop control system. For a human-in-the-loop interactive robotic system, a motion capturing system serves as the primary technology to digitally record the human body movements. In the context of head/neck motion capture, we have proposed a novel computer vision algorithm which is based on low cost, low resolution, low bit rate camera that offers several advantages over wearable devices. The proposed computer vision method captures the real time movements of a person’s head and maps them to a TEBoT’s mechanical assembly.
Related work
Interactive robots
In last decades, a number of commercial and non-commercial interactive systems/robots have been developed. These robots can be broadly categorized into i) Non-Anthropomorphic Robots (N-APR) and ii) Anthropomorphic Robots (APR). The characteristics and appearance of N-APRs are not similar to real human characteristics and appearance, but they still possess some socially interactive skills. The most common example of N-APRs are mobile robotics telepresence (MRP) systems. The basic construction of almost all MRPs consist of a mobile robotic base, an LCD screen, a camera, a microphone and some non-verbal gestures, like hand gestures, etc. A comprehensive review on MRPs can be found in [9]. The other example of N-APR robots includes Mebot [10], ESP [11], Jibo [12], Keepon [13], etc. On the other hand, the characteristics and appearance of APRs are similar to real human characteristics and appearance; for example, HRP-4C [14], Actroid DER3[15], Geminoid [16], Furhat [17], etc. are the APRs which look very similar to real human and it can be hard to decide whether it is a robot or a human. There also exists a set of APRs called android which have characteristics similar to humans but their faces are not similar to humans such as ASIMO [8], Telenoid [18], iCub [19], etc. The detail review on socially interactive N-APRs and APRs can be found in [20].
When it comes to video teleconferencing application, there are limitations in almost all previously developed N-APRs and APRs. The N-APRs do not present an accurate nonverbal gestures and APRs are complex, expensive and just represent one person. Furthermore, these systems are explicitly controlled by the mouse, keyboard and other hand-held devices. By considering these limitations, we have designed and built a novel socially interactive system - named - TEBoT, which can present an accurate head gesture along with audio-video communication. The research in psychology shows that among all non-verbal modalities during human conversation, facial expressions and head movements are the most important for the flow of information [21]. TEBoT has the capabilities and appearance of APR and simplicity of N-APR. TEBoT is portable, cheap, easily controllable and present exact non-verbal head gestures. Furthermore, a novel single camera-based human-in-the-loop control strategy is devised for the TEBoT control.
Human motion capturing approaches
To actuate a socially interactive bio-inspired robot, one of the best possible way is to input it with a real human motion [22]. Approaches to capture these human-body motions can be categorized into two sub-categories:
Sensor based approaches
In sensor based approaches (SBA), the motion capturing sensors (such as, accelerometers, gyroscopes, IMUs, GPSs, motion sensors, force sensors, electromagnetic trackers, ultrasound trackers and pressure sensors, etc.) are mounted to capture the human body movements. For example, wearable sensors attached to the human body can be used to capture human head movement [23], hand movement [24, 25], shoulder-arm movement [26], etc. Similarly, the full-body motion capture suits [27] are also being employed to record the human movements and later to control the robots from head to toe [28]. The details of motion capturing technologies can be found in [29].
Vision based approaches
In vision based approaches (VBA), the human body movements are captured by a set of camera(s) and/or depth sensors (such as MS Kinect, etc.). Recently, the vision based techniques have gained popularity for estimating the pose angles of human head [30, 31] and human hand [32] by using single, multiple or depth cameras. These VBA are also used to captures full body movements by single camera [33], two cameras [34], multiple cameras [35] and depth camera [36]. These body movements are then used to actuate the robotic system [37].
In the context of motion capture, the computer vision based approaches offer several advantages over the wearable sensor based approaches. In SBA, complex body-mounted sensors restrict the body movement in an environment and it is an overhead for a person wearing these sensors. In case of a non-wearable vision based system, the complex configuration and number of costly cameras require suitable laboratory settings for actuating robots. Hence, there is a need to build a system, which is easy to operate, portable and does not require costly and complex laboratory settings. In this work, we consider this important issue, i.e. how to precisely actuate the movement of biological system by using a non-contact, low cost sensor with high performance in terms of speed.
Contribution
The key contribution is in the design and control of a socially interactive head/neck robot, where real head motion dynamics are considered in the design of mechanical assembly. For motion control, we have proposed a novel vision based geometric head pose estimation technique to capture the human head movements and actuate the robotic platform in real time.
Head motion analysis
To analyze the characteristics of a human head dynamics, we mount an inertial measurement unit (IMU) (containing 3-axis gyroscope and 3-axis accelerometer) as shown in Fig. 1. The gyroscope gives measurement of angular velocities around z, y and x-axis denoted as [, , ]. The accelerometer outputs linear accelerations [a z (t), a y (t), a x (t)] along z, y and x-axis. These measurements are used to estimate the angular position using kalman filter. These variables are denoted by the vector of generalized coordinate [ψ (t),θ (t), φ (t)].
Procedure
For this work, we have recruited seven participants ranging in the age from 22 to 45 years. The participants are asked to conduct recordings of the angular motion [ψ (t), θ (t), φ (t)] and angular velocities [, , ] by using a head-mounted IMU. The participants are asked to perform five tasks with their head movements, where each task are performed more than ten times. These five tasks represent the actions of saying Yes, No, May-be and also two motion patterns for performing a Circle and triangle. Head Nod (Yes): Move head up and down, exciting mainly the pitch coordinate. Head Shake (No): Move head left and right, exciting mainly the yaw coordinate. Head Roll (May-be): Move ears close to shoulders, exciting mainly the roll coordinate. Head Circle: Make circles by head movement. Head Triangle: Make triangles by head movement.
Frequency analysis
The frequency analysis measures the operating frequency-band of the human-head movements for all the five tasks. The frequency spectrum for each task is estimated by using the measurements of angular velocities. This can be done by applying fast fourier transform (FFT) and/or power spectral density (PSD). The results of the frequency analysis for each task is presented in Table 1. The table shows the minimum, maximum and the mean frequencies of human head movements. Two out of five trajectories can be visualized in Fig. 2, where we show the angular positions [ψ (t), θ (t), φ (t)], angular velocities [(t), (t), (t)] and the frequency spectrum for head nod and head circle movements. These frequency-analysis results are used in the design of a robot controller.
Velocity analysis and threshold calculation
The collected raw data is further post-processed to calculate other important attributes such as, kinematic constraints. These quantities give an information about a range of the head movement and constraints in velocities. These results are shown in Table 2. This information is used to design our robot architecture. Additionally, the velocity constraints indicate the required specifications for the actuators.
Table 2 shows that the head nod, head shake and head roll movements also involve components of yaw, pitch and roll angles. Ideally, these movements should present only pitch movement for head nod, yaw movement for head shake and roll movement for head roll. These extraneous rotational components define our software threshold, which removes an undesired movement around zero velocities both in position estimation and control.
Biologically inspired design
Biological system: Human neck
Human neck has a complex anatomical structure formed by seven cervical vertebrae (see Fig. 3a) and around twenty main muscles (see Fig. 3b). The main role of the neck is to support human head in balance while performing different head movements. Human neck muscles are responsible for performing head movements and the combination of cervical vertebrae and muscles are used to hold the head in upright position. There are three essential head motions, i.e. yaw, pitch and roll (see Fig. 4) and all other movements are combinations and varying percentages of these three head movements. With reference to cervical vertebrae, the neck muscles are divided into left and right side of the neck muscles; to present a parallel configuration acting at both sides of the shoulders. Working in pair, the left and right sides of the neck muscles control the pitch movement. Working individually, these muscles control the yaw and roll movements (see [38] for more details).
Biological inspired design
To mimic the movements of a human neck, a system should satisfy not only the mobility properties of a human neck, but also static and dynamic characteristics. The CAD model of our mechatronic system is shown in Fig. 5(a) and a working prototype in Fig. 5(c). Our design uses two active limbs (label 7 in Fig. 5(a)) and one passive limb (label 8 in Fig. 5(a)). The passive limb is a central rod connected between a base (label 3 in Fig. 5(a)) and a mounting assembly (label 4 in Fig. 5(a)) via a universal joint (label 6 in Fig. 5(a)). Whereas, the active limbs are the connecting rods connected between the motors (label 2 in Fig. 5(a)) and the mounting assembly. The function of the passive limb is similar to the function of a cervical vertebrae in human neck and the function of the active limbs resembles the function of human neck muscles. The passive limb in combination with two active limbs are used to support the mounting assembly of a tablet PC (label 1 in Fig. 5(a)). The active limbs control the pitch and roll movements; the motion of the yaw is performed by a motor inside the base (see Fig. 5(b) (label D)). Similar to the human neck anatomy, our system is a 3DOF parallel kinematic system actuated by three servo motors assembled on the base.
The results of head motion analysis are considered in the mechanical design of our robot. From Table II, the maximum of average peak angular values for yaw (ψ), pitch (θ) and roll (φ) movements are 1.43 rad, 0.642 rad and 0.65 rad, respectively. Considering these results our mechanical design undergoes ±1.45 rad for yaw movement, ±0.76 rad for pitch and roll movements.
Motor selection
In upright equilibrium, the weight of the mounting assembly (label 4 in Fig. 5(a)) and tablet PC (label 1 in Fig. 5(a)) is supported by the passive limb (label 8 in Fig. 5(a)). The whole upper assembly is attached to the base through ball bearings (see Fig. 5(b)). Two parameters are considered important in the selection of motors. The torque in kg-cm (or N-m) and the speed in revolution per minute RPM (or rad/sec).
Torque for left and right servo motor
The total weight of a tablet PC (approx. 1 kg), mounting assembly (0.5 kg) and two active limbs (0.25×2 = 0.5 kg) is 2 kg. The length of the left and right servo arm is 4.5 cm (0.045 m). The combined torque required for the left and right servo is 2 kg×4.5 cm = 9 Kg-cm. The required torque for individual servo motor is 9 kg-cm/2 = 4.5 kg-cm.
Torque for base servo motor
The total weight of the upper assembly including a tablet PC, left and right servos is 2.4 kg. The length of base servo arm is 3.3 cm (0.033 m). The required torque for base servo motor is 2.4 kg×3.3 cm = 7.92 kg-cm.
Velocity requirements
The results in the previous section are used to define the required maximum speed for the motors. Table 2 shows that the maximum speed is 2.94 rad/sec (28.074 RPM).
To comply with above specifications, we have selected a TowerPro SG-5010 - Standard servo that can provide a torque of 8 kg-cm and speed of 58.8 RPM at 4.8 V, and a torque of 11 kg-cm and speed of 71.4 RPM at 6 V.
Controller selection
The servo motors function using PWM signals. To generate these PWM signals, a controller is required that can handle at least three PWM signals in parallel. Based on this requirement we have selected Arduino Uno, which operates at 5 V with a clock frequency of 16 MHz. It has 14 digital input/output pins, of which 6 can be used as a PWM output.
Model based design
Model based design (MBD) [39] is currently applied in a variety of industries. Just as computer aided design (CAD) provides a geometric way of describing an equipment, MBD incorporates the dynamics and performance requirements to properly describe an overall system in a simulation environment.
From CAD to TEBoT simulation
In order to use a CAD model for MBD, the properties of the physical model have to be transformed into its corresponding set of differential equations describing the system dynamics and the equations of motion. Using the state-of-the-art control engineering design software this can be done by using certain add-ons in e.g. Solid Works. One example of this procedure is by the use of the Simmechanics [40], a product of Mathworks, that allows users to convert a CAD design model into a set of differential equations that represent the dynamics of a robotic motion. These set of differential equations can be simulated and further used for model based control design by applying other tools within the mathworks products.
Robot motion control
The task is to construct a controller for our TEBoT that meets the desired behavior of a closed loop control system. The dynamics of yaw movement are simple and linear so the yaw controller is simple proportional controller. On the other hand, the pitch and roll dynamics of TEBoT are nonlinear and are given by:
To simplify the controller design for pitch and roll movement, we apply a linearization of the model using Taylor series [41].
Where K p denotes the values of proportional gains, K d the values of derivative gains, and K I the values of integral gains. The gains of the multiple PID controllers are tuned automatically in simulink by using the tune tab of the PID controller.
The main controller C m maps the desired trajectories of the TEBoT to the corresponding servo trajectories using inverse kinematics. It starts with the desired rotational angles (θ d and φ d ) of the TEBoT and calculates the required servo rotations (q l and q r ) to achieve this. The gains of the C m for the given crossover frequencies between 1 and 11 rad/sec (taken from Table 1) are tuned with the help of optimization methods described in [42] using MATLAB with minimal loop interaction and adequate MIMO stability margins.
This section describes a single camera based geometric head pose estimation technique. Our method uses the location of facial features such as the eyes, mouth, and nose tip to determine pose from their relative configuration. We assume that a human head is three degrees of freedom rigid object with yaw, pitch and roll angles denoted by ψ, θ and φ, respectively, as shown in the Fig. 4. The following procedure is used for estimating these pose angles:
Face detection
The input to our geometric head pose estimation algorithm is the video frames containing, i) the face of a person and ii) the undesired background. We first detect a human face in a cluttered video stream. Towards this end, number of algorithms have been proposed [43, 44]. However, we have employed Haar-feature-based cascade classifiers proposed by Paul Viola and Michael Jones for human face detection [45]. This algorithm is a two-step process which consists of a training step and a testing step. In training step, the algorithm learns to differentiate between a face image and background. In testing step, the algorithm uses the training information to detect a face in a video stream (see Fig. 7(a)).
Facial features detection
The second step is to find facial feature points in a detected face. Facial features include eyes, nose, lips, mouth, eye-brows and facial boundary. Numerous methods exist for detecting human facial features as presented in [46], In this work, we have employed the well-known Constrained Local Model (CLM) approach [47]. The CLM is also a two-step process which contains a training session and a testing session. In training session, the shape and texture models are built from a training set of large number of labeled face images. The shape model includes a face and facial feature points and the texture model includes the intensity values of the face and facial features. In testing session, our algorithm iteratively estimates the facial feature location of an unseen image by using the combined information of the shape and texture models. The result of CLM is shown in Fig. 7(b).
Estimating human neck’s reference frame
Human head has three independent movements (yaw, pitch and roll) around Y, X and Z axes as shown in Fig. 8. The video is formed by a sequence of consecutive images, and images from standard camera contains only 2D information, i.e. X and Y coordinates. This information can be used to compute a roll angle of a human head. However, it requires 3D information for computing yaw and pitch angles. Hence, this step consists of estimation of the Z coordinates to define a reference frame of a human neck. This reference frame is assumed to be at C2 of the spinal column as shown in Fig. 9, and it is found by using the location of the eyes and the facial boundaries [48]. These features allow us to compute the width w, height h, and the distance between the center of the eyes d from the detected face, see Fig. 9. Given these quantities, the neck reference frame is given by:
This 3D reference point is used to estimate the yaw and pitch angles of the human head.
In the case of the roll angle φ, it is sufficient to know the positions of the center of each eye, i.e. E
l
= (X
l
, Y
l
) for the left eye and E
r
= (X
r
, Y
r
) for the right eye. Therefore, φ is computed by right angle triangle shown in Fig. 10(b):
For estimating the yaw ψ and pitch θ angles, we define the vector from O n to O e , where O e = (X e , Y e , Z e ) is middle point of the center of the eyes, see Fig. 10(a).
The projections of
The angles between the projections and the Z-axes give us the yaw ψ and pitch θ angles of the human head as computed below:
The output of our vision based geometric head pose estimation algorithm could not be used directly due to several limitations lie at the software and hardware ends. At the software end, we have the limitation of video acquisition and camera parameters; at the hardware end, we have the limitations of mechanical structure parameters. To map the complex head dynamics to a limited three degrees of freedom platform, we have used Kalman filter [49]. The kalman filter uses the set of second order differential equations to predict the future state of the angular signals and hence, temporally shaping these geometric head pose angles to make it a suitable input for our mechatronic robot.
The kalman filter uses the discrete time state space equation to govern the dynamic relation of these signals (ψ, θ, φ) in two successive time steps given by k - 1 and k:
The matrix A in Equation 9 is the state transition matrix and the vector B in Equation 9 is the control input model and are given by the following equations:
The second step is the measurement update step in which it updates the measured angular signal values of yaw, pitch and roll angles according to a predicted angular values from the state prediction step. The measured angular signals from geometric head pose estimation algorithm is denoted by Z
k
. The updated angular state is given by:
Where K
k
is the kalman gain and C
k
is the 1 × 2 vector and is given by,
At the end of the measurement update step, we update the covariance matrix for next step by using the following equation:
The output of the kalman filter in each frame k is , which comprises of three updated pose angles and three velocity estimates of the human head movement, i.e.:
The added advantage of kalman filter is that it estimates the velocity parameters of human head which are given by [, , ]. Whereas, [, , ] are the updated pose angles which are now suitable input for our mechatronic robot.
Our system consists of four main blocks: i) a vision based algorithm (VBA) block, ii) a filtration block, iii) a modal based design (MBD) block and iv) a real platform block, as shown in Fig. 11. The VBA for geometric head pose is implemented in VC++. The MBD of TEBoT is simulated in simulink (Matlab) environment. The communication between VBA block and MBD block is done through internal TCP/IP. The input to the control algorithm of MBD is yaw, pitch and roll angles from filtration block. The control algorithm block implements a PID controller based on an error between new pose angles and the previous pose angles. The servo controller takes input from the PID controller and generates PWM signals for performing yaw, pitch and roll movements. These PWM signals can be used by sim-mechanic model for visualization of TEBoT response and similarly they can be used for real time testing with TEBoT hardware. The automatic generated code from MBD is implemented on arduino to implement control algorithm during testing. For this testing the communication between simulink and the hardware is done through USB port of a computer. The sim-mechanics provides a feedback which completes the closed loop control system.
Experimental results
The experiments are conducted to measure; the accuracy of our vision based algorithm. the tracking performance of the human-in-the-loop TEBoT system.
The accuracy of head pose estimation algorithm can be measured by one of the two ways.
In former, the head pose data-sets are usually annotated by the latter technique, i.e., by using an expensive trackers. Furthermore, the accuracy of less expensive IMU is comparable with the accuracy of expensive trackers as proved by [52]. In this work, we have measured the accuracy of our geometric head pose estimation algorithm by employing the latter technique and used an Inertial Measurement Unit (IMU) as a ground-truth.
The experiment is performed in which a user moves his head in different orientations and the data are logged during run-time. The logging frequency of both data (IMU and our algorithm data) is 25Hz and for this experiment we have logged 4000 frames. The comparative results of IMU data with geometric head pose algorithm data is presented in the form of mean error and standard deviation for each yaw, pitch and roll angles as shown in Table 3. Some of the recorded frames for yaw, pitch and roll angles are also shown in Fig. 13(a, b, c).
For measuring the performance of overall system, we took pose angles (yaw, pitch and roll) of ten different people through our geometric head pose algorithm. These raw pose angles are saved for further processing. From Fig. 11, these pose angles are first filtered through Kalman filter. The recorded trajectories can be visualized in Fig. 14. The columns show the response of kalman filter on yaw ψ, pitch θ and roll φ angles. The gray signal is one of the recorded sequence and the striped-black signal is the filtered response. The second row in Fig. 14 shows the estimated velocities through kalman filter.
Following Fig. 11, the filtered signals become an input to MBD block. The MBD block implements the control algorithm for mimicking the head movement based on an input. The MBD block includes hardware-in-the-loop (hardware block). Where, the controller parameters are left as designed, i.e. the tuning made by MBD block is not modified for performing real experiment. The results are presented in the form of tracking performance as shown in Fig. 15. The gray signal shows the filtered input signal and the striped-black signal shows the tracked signal by MBD.
Conclusion
In this work, we have presented a design and development process of a human-in-the-loop socially interactive bio-inspired head/neck robot where single camera based motion capturing in combination with Model bases Design (MBD) approach is used for mimicking human head movements. We have developed a reliable and robust biologically-inspired neck platform (TEBoT) using intuitive human-head motion analysis. The TEBoT is compact, self-contained and fulfills the static and dynamic performances of a human neck. In terms of modeling, we have presented model based design (MBD) technique, for which we have transformed the physical CAD model of TEBoT into a set of differential equations describing the system dynamics and the equations of motion by using sim-mechanics library.
For designing the input control of the TEBoT, we have included human-head in the loop where real-human head provides an input to the TEBoT. To capture these real-human head movements we have considered the limitations of previously developed SBAs and VBAs and proposed a novel vision based technique which captures the pose angles of human head in real-time without using any wearable sensors and/or markers. Our proposed geometric head pose estimation algorithm calculates the pose angles based on facial feature points and the geometric manipulation of these feature points. Our novel input control is based on low cost, low resolution, low bit rate and non-wearable webcam of the computer.
Once we have all the sub-modules, we integrated them for real time visualization and testing. The real time visualizations have been done under MATLAB/Simulink. Which allows to perform simulation studies, automatic tuning of control parameters, and code generation for hardware-in-the-loop testing. For real testing the automatic code generation capability of MATLAB was used.
The experiment tests were done to i) measure the accuracy of our geometric head pose estimation algorithm and ii) measure the performance of overall system in mimicking the real head movements. The experimental results show the effectiveness of our geometric head pose estimation technique and satisfactory tracking performance for over-all system.
This article presents an idea of including human-in-the-loop and mapping real human head movements to our socially interactive bio-inspired head/neck robot by using monocular camera. This idea can be extended to other body parts of biological system and is the aim of future work. Our proposed technique can be used in learning-by-demonstration or imitation-learning field. In future, we plan to use the TEBoT as an embodied agent for tele-operation especially in a video teleconferencing for presenting head gesture of a remote person. TEBoT can also be used for assisting elder people, distance learning scenarios and even for entertainment industry.
Acknowledgments
The authors would like to thank Dr. Pedro La Hera and Dr. Daniel Ortiz Morales for their help in the controller design of the TEBoT.
