Abstract
Current eLearning systems enable streaming of live lectures to distant students facilitating a live instructor-student interaction. However, studies have shown that there exists a marked divide in local students’ (student present in the teacher’s location) experience as compared to distant students’. One of the major factors attributing to this rift is lack of gaze aligned interaction. In this paper, we present a system architecture that receives gesture triggers as input, and dynamically calculates the perspective angle to be captured of the speaking participant, for the listener, facilitating eye contact. The gesture triggers are calculated using Microsoft Kinect sensor which extracts skeleton joint information of the instructor, and performs gesture recognition with the acquired joint information real-time. This serves as interaction-initiation triggers for dynamic perspective correction for gaze alignment during a conversation. For evaluation, we constructed a five classroom test-bed with dynamic perspective correction and user study results indicate a marked 42% enhancement in experience with the gaze correction in place.
Introduction
Often, using current eLearning systems, distant students feel a sense of disconnect from the learning environment as compared to the local students [1]. Quality of teacher-student interaction forms a key aspect in enhancing the learning experience. Often remote students are disadvantaged as compared to local students in this regard. Local students often experience a “sense of community” and build a “sense of trust” which is a major factor in enhancing the feeling of belonging in the community [2]. A major factor enhancing the quality of interaction is gaze alignment. In most eLearning systems hitherto, the instructor gaze does not meet the interacting student’s leading to a discomforting engagement during a conversation [3].
We observed that, gaze by nature is a very dynamic attribute and it depends upon the 1) type of interaction, 2) Placement of objects across different locations. To illustrate interaction dependency, let us consider the scenario in a conventional classroom environment, “instructor,Iis talking to student,S x and all others students,S*are listening to their conversation” as observed during a typical question and answer session. We observe 1) I looks at S x , 2) S x looks at I and 3) S* looks at I. This however is not conserved in an eLearning scenario. Position of the camera capturing the video of a distant participant determines the perspective presented to the viewer; perceived gaze direction of a participant on video display can be controlled by providing the correct perspective. This illustrates placement dependency.
In this paper, we present an architecture that tracks the instructor’s gestures to identify the types of interaction and dynamically adapts the perspective of the instructor given to the distant students such that during a conversation with a distant student, both the instructor and the interacting distant students are provided with gaze aligned perspectives of each other. To do this, we employ Microsoft Kinect Sensor with Openni2 and NiTe2 application interfaces to track the instructor’s movements. Kinect’s depth sensor extracts joint information of the tracked person. The joint information is used to identify various gestures such as pointing to a student. The instructor simply has to point at the appropriate display containing the student’s video generating an interaction initiation trigger. Once an interaction trigger is received and recognized, the system switches perspective of the instructor by picking the appropriate camera from a set of cameras around him/her and streams the video from that camera to be displayed at the remote interacting student’s location. Perspective of the instructor is shown such that distant interacting student perceives eye contact. Upon reset (end of interaction), a central (best possible) view point of the instructor is given to all distant students making them feel that they are in the center of the learning environment.
As a result of implementation of the proposed technique, the classroom experience of the distant student gets transformed from a gaze insensitive experience as shown in Fig. 1(a) to a coherent gaze aligned experience as shown in Fig. 1(b).
This paper is organized into sections as follows; Section 3, discusses prior art that constitutes the various modules in architecting the gaze alignment systems. Section 4: Problem description section presents the anomalies in gaze directions in a typical eLearning setup. In section 5, we discuss in detail the solution to amend the gaze misalignment problem. We also describe a test-bed prototype to test the effectiveness of the system as discussed in section 6. The evaluation process with a set of live participants across five distant locations is then used to compare and contrast the effectiveness of the system with and without gaze correction.
The implementation of the proposed system was carried out using a four classroom test-bed. User study results indicate a marked 42% improvement in the quality of interaction as perceived by the distant students.
Related work
Work done by Janet Owens, Lesley Hardcastle [1] shows that remote students do not experience the same inclusive feeling as the locally present students (students present in the same location as the teacher), being a part of the classroom. The remote students often feel a sense of isolation when it comes to interaction with the teacher or with their peers in the remote classroom [1].
Learners experience a “sense of community”, enjoy mutual interdependence, build a “sense of trust,” and have shared goals and values when having a face to face interaction in a conventional classroom setup [2].
Student to instructor interactions forms a very important paradigm in eLearning system (Fulford & Zhang, 1993; Kumari, 2001; Sherry, 1996) [4].
Work done by Jung et al. [5], shows that the learning experience of remote students is significantly enhanced through social interaction with peers rather than with the teacher alone [5].
Monk and Gale [6] in the work “A look is worth 1000 words: Full gaze awareness in video-mediated conversations” describe the concept of full gaze awareness in video-mediated interactions describing “Mutual Gaze” and “Gaze Awareness” as key aspects of gaze alignment [6].
Cisco’s telepresence system [7] aims at enhancing the eye gaze to make the interaction more immersive by using a plurality of cameras. The cameras of the system are arranged such that the local user within a local user section looks at a target at least one remote display or reproduce the local video image of the first user section such that the eye gaze of the reproduced image of the first local user is directed approximately at the corresponding target. However this system offered a rigid solution as the user’s relative location is conserved across geographies which becomes impractical in an eLearning scenario.
Work done by Romiszowski and Mason [2007] illustrates that though video mediated communication enables a partially interactive immersive experience, it however lacks in supporting creativity and personal involvement of the participants in an online discussion [8]. Computer mediated communication (CMC) often reduces phatic functions giving reassurance to both the speaker and listener [9]. Although CMC links fora having information with those that don’t, it has been observed that CMC often fails to ensure high quality participation (Stephen, T. and Harrison, T.M., Comserve [1994]). Work done by Herb Thompson analyzes the advantages and disadvantages of CMC for learning [9].
In a large and dynamic interactive setting, such as a classroom environment, gaze directions are harder to be preserved owing to simultaneous multipartite modes of interactions. Cisco’s Telepresence System [2010] aims at enhancing the eye gaze to make the interaction more immersive by using a plurality of cameras. The cameras render a perspective such that participants are around a table as they engage in video conferencing. However, this system offers a rigid solution as the user’s relative location is conserved across geographies which becomes impractical in an eLearning scenario.
Work done by Ruigang Yang and Zhengyou Zhang [10] involves graphic hardware that synthesizes eye contact based on stereo analysis combined with rich domain knowledge to present a video that maintains eye contact. Such systems often correct gaze misalignment over small angles [10]. Similarly, work done by Jason Jerald and Mike Daily involves real time tracking and warping of eyes, using machine learning algorithms appropriately on each frame, thus giving a feeling of natural eye contact between the interacting participants. Such solutions are applicable only in bipartite video conferencing solutions [11].
Some other technologies for gaze correction include the work done by Kuster, Popa, Bazin [12] using Microsoft Kinect Sensor to overlay the gaze corrected portion of the face over the original image to obtain fairly artifact-free real time video [12].
Problem description
To effectively analyze the problem, let us consider a typical eLearning setup as shown in Fig. 2 with an instructor – I, present in location - L0, two distant students, S1 and S2 are present in locations – L1 and L2. Camera – C I , captures the video of the instructor to be streamed to distant locations. Likewise, C S 1 and C S 2 in L1 and L2 respectively captures the video of distant participants S1 and S2 to be streamed to the instructor’s classroom. Displays – D S 1 and D S 2 in L0 shows the video of S1 and S2 for the instructor. D I s in L1 and L2 shows the videos of instructor for S1 and S2. The positioning of these objects is shown in Fig. 2.
Consider the scenario where the “instructor is interacting withS1”. As shown in Fig. 2, we can observe that.
It can be observed that, there is no eye-contact between I and S1 in L1 during an interaction with the above setup.
Interaction states
In this section, we take a look at how to synthesize a gaze coherent system that dynamically adapts to interaction patterns. We observed two different interaction states viz, Directions of the vectors are shown in the below Fig. 4.
Gesture triggered switching
As classroom session globally moves between Lecturing states and Question and Answer states, instructor’s gestures such as finger points have to be monitored for interaction state change triggers. Microsoft Kinect sensor with OpenNI2 and NiTe2 application interfaces are employed for user skeleton tracking. Kinect outputs the skeleton joint information with each joint containing a three dimensional coordinate for the hands and elbow coordinate as shown in the Fig. 5. The joint information is processed to calculate the various poses for gestures.
To calculate the angle of finger point, we use the elbow and hand joint information obtained from the kinect as follows.
Hand position, Ph – (xh, yh) Elbow position, Pe - (xe, ye) Arbitrary point, P parallel x-axis from elbow joint - (xe + Δ, ye), where Δ describes a small distance.
Gaze aligned system architecture
In this section, we describe a new architecture for developing a gaze aligned system. Our architectural setup consists of a multitude of cameras and display units in the instructor’s classroom. Each of these devices is a part of the network that spans across and connects the different participating locations. Each display unit dedicatedly displays a set of distant students from each of the distant locations.
Each camera unit is positioned in line with the display unit facing the participant its capturing such that, during an interaction, when the instructor looks at the display showing the target students (students who the instructor intends to interact with), the target students see the frontal view of the instructor. The camera capturing the video of the distant students for the instructor is placed in line with the display showing the instructor in the student’s classroom. This provides a frontal viewpoint of the distant students for the instructor.
The below figure describes n participating distant classrooms. The 0th classroom is considered the instructors classroom while the remaining 1 to n - 1 numbers are given to the classrooms containing other participating students. The 0th classroom and any arbitrary ith (where i belongs to 1 to n - 1) classroom is shown in the figure below. Arrangement of cameras and displays are also shown in the figure.
When the instructor is in the lecturing mode, Instructor sees the video feeds of distant students on displays – D
S
1
, D
S
2
… D
S
n-1
captured by cameras - C
S
1
, C
S
2
… C
S
n-1
. Camera – C
I
captures the video feed of the instructor and it is mapped to all the display – D
I
showing the instructor’s video.
When the instructor intends to interact with kth set of participants, Instructor sees the video feed of S
k
on display - D
k
captured by cameras - C
k
. S
k
sees the video of the instructor captured by camera C
T
k
on display - D
S
k
in L0. All other distant students other than S
k
sees the video of the instructor captured by the camera – Cn/2 if ′n′ is even and Cn+1/2 if ′n′ is odd.
Evaluation
To evaluate the effectiveness of the system, we designed a four classroom test bed and the participating students for the experiments are subjected to two lecture sessions – with and without gaze correction. These students were then asked to rate the effectiveness of both the systems on a scale of 1 to 10 on the ease and naturalness of intractability with the instructor. Without the gaze correction system the students rated the interaction with a mean, μ = 4.5 and variance, σ2 = 1.673. With gaze correction the students rated the interaction with a mean, μ = 8.7 and σ2 = 0.1944. Results indicate a whopping 42% improvement in the feeling of presence during an interaction.
Conclusion
Our implementation of the gaze alignment system with dynamic feed switching enabled the distant participant to have an engaging instructor – student interaction thereby enhancing the feeling of belonging in the classroom environment. The system was incorporated with tweaks such as transitional effects during video switching that made the change in perspective less palpable. This further enhanced the feeling of naturalness.
