On-line human activity recognition from audio and home automation sensors: Comparison of sequential and non-sequential models in realistic Smart Homes 1

Abstract

Automatic human Activity Recognition (AR) is an important process for the provision of context-aware services in smart spaces such as voice-controlled smart homes. This paper presents an on-line Activities of Daily Living (ADL) recognition method for automatic identification within homes in which multiple sensors, actuators and automation equipment coexist, including audio sensors. Three sequence-based models are presented and compared: a Hidden Markov Model (HMM), Conditional Random Fields (CRF) and a sequential Markov Logic Network (MLN). These methods have been tested in two real Smart Homes thanks to experiments involving more than 30 participants. Their results were compared to those of three non-sequential models: a Support Vector Machine (SVM), a Random Forest (RF) and a non-sequential MLN. This comparative study shows that CRF gave the best results for on-line activity recognition from non-visual, audio and home automation sensors.

Keywords

Activity recognition Markov Logic Network Statistical Relational Learning Smart Home Ambient Assisted Living

1. Introduction

Automatic human Activity Recognition (AR) is an important process for human behaviour monitoring but it is also extensively studied for the provision of context-aware services for smart objects (smart-phones, robots…) and smart spaces (smart homes, smart rooms, public spaces…) [20]. Smart Homes in particular have become a topic of increasing interest since they are a promising way to improve the daily life of people with loss of independence (elderly people or people with physical or cognitive disabilities) so that they always keep control over their lives and continue to live independently, to learn and to stay involved in social life. These technologies can also improve the life of the carers (who are often close relatives) by reducing the human and financial burden of such situations [30,57,65,71].

Many projects related to Smart Homes have been supported by national and international research foundations to address the challenges imposed by a growing elderly population such as Adaptive house [51], AwareHome [31], C@sa [22], Cirdo [9], Ger’-Home [88] MavHome [19], PlaceLab [35], or Sweet-Home [77]. All of them have integrated human activity modelling and recognition in their systems.

Most of the progress made in the AR domain came from the computer vision domain [1]. However, the installation of video cameras in the user’s home is not only raising ethical questions [72], but is also rejected by some of the targeted population [59].2

²
As for any technology, video cameras can be very well accepted if the benefit is perceived to be higher than the feeling of intrusion.

Moreover, video processing is highly sensitive to light conditions which can dramatically vary in a home. Other approaches rely on information from RFID tags [32] and wearable devices [87]. In the first case, putting RFID tags on objects makes the maintenance of Smart Homes burdensome since any new object implies technical manipulations to fix the corresponding sensor and to configure it. The case of wearable sensors is sometimes not applicable when inhabitants do not want (or forget) to wear sensors all the time. Moreover, cost and dissemination of assistive technologies would be lower if they were built on standard home automation technologies with minimal technical additions. This is why there is an increasing interest in automatic human activity recognition from home automation sensors [13,14,55,63,70,72,84,85]. This type of environment imposes constraints on the sensors and the technology used for recognition. Indeed, information provided by the sensors for activity recognition is indirect (no worn sensors for localisation), heterogeneous (numerical or categorical, continuous or event based), noisy, and non-visual (no camera). This application setting calls for new methods for activity recognition which can deal with the poverty and unreliability of the provided information and can process streams of data. Moreover, these models should be checkable by humans and linked to domain knowledge.

In home automation sensor based AR, the problem has often been approached using off-line machine learning methods on pre-segmented activity intervals [6,26,54]. In that case the entire information (past, present, future) is considered to be accessible and the detection problem is ignored (i.e. detecting when an activity starts and ends). If such an approach is valid for off-line analyses of human behaviour, many real-world applications will need real-time or at least on-line AR. For instance, context aware systems must know, at the time of the user’s interaction, which activities the user is performing. This task is more difficult than the off-line one as only present and past information can be used and classification must be provided within a reasonable time. Another issue is that the system must deal with activities that are not known a priori to avoid undesirable behaviours.

In this paper, we present an on-line activity recognition method for AR within homes in which multiple sensors, actuators and home automation equipments coexist. This research is carried out as part of the Sweet-Home [77] project which aims at developing a complete framework to enable voice command in Smart Homes. In this framework, the interpretation of the commands and the decisions to be made depend on the context in which the interaction occurs. This context is composed, among other information, of the user’s current activity. For instance, if the user utters “Turn on the light”, the best action, if she is awaking in the middle of the night (respectively if she is dressing in the morning), could be to provide low intensity light using the bedside lamp (respectively high intensity light using the ceiling lamp). To perform on-line AR in Smart Homes from audio and home automation sensors, a framework was developed to summarise the stream of data into temporal windows and classify each window into one known class, or into a specifically defined Unknown class. This research brings the following contributions:

The integration of audio signals with home automation sensors for AR is an understudied area. This work not only demonstrates the interest of such fusion but also brings the first complete datasets for AR that contain home automation data as well as audio signals with multiple users. These datasets are available to the community [27,80]. Some of them were acquired during experiments in a realistic smart home involving elderly and visually impaired people [76].

The framework for on-line AR makes it possible to summarise asynchronous as well as continuous sampled signals into temporal windows.

The paper introduces a recent model for AR – Markov Logic Network – in both sequential and non-sequential versions. Moreover three sequential and two other non-sequential models for the AR task were tested and compared.

These models were evaluated on the above-mentioned datasets in a realistic way since windows of unknown class are fed to the classifiers. Indeed, in real world setting not all possible activities can be learned thus applications must be able to handle unforeseen situations. Moreover, to avoid overfitting, a cross validation technique was designed so as to exclude from the learning set, the participants’ records used for testing.

The paper is organised as follow. After a short description of the AR classification techniques in Section 2, the framework for on-line AR is detailed in Section 3. In particular this section introduces three sequence based models namely the Hidden Markov Models (HMMs), Conditional Random Fields (CRFs) and finally the Markov Logic Networks (MLNs), a statistical relational method that combines high expressibility (first order logic) with the handling of uncertainty. The methods were tested in two experiments performed in two real Smart Homes involving more than 30 participants. Moreover, their results were compared to those of state of the art non-sequential models such as are Support Vector Machine (SVM) and Random Forests (RF). These experiments and the corresponding results are described in Section 4. The paper ends with a discussion in Section 5 and a short conclusion in Section 6.

2. Related work

In the literature, Activity Recognition (AR) has been defined differently according to the level of granularity under consideration. In some works, for instance, a movement such as standing up, running or walking is considered as an activity [39]. As the activity to be recognized depends on the movement of the body, worn sensors are often used. This can be found in research concerning medical assessment [45] or daily activity interpretation [58]. Some other works consider the variation of a certain task: making tea, coffee, or preparing a meal [56]. In such cases, each activity is a specialization of a general task, and frequently the accuracy of the recognition is related to the number and type of the applied sensors since some subtask can only be recognized by the use of a particular sensor. In some applications of surveillance in public places, activities are considered as interactions among people. For instance, complex activities such as fighting and stealing are identified by means of video recognition techniques [3,47,67].

Besides the level of granularity of the activity, the way to perform the recognition can be divided into off-line and on-line. The former case consists of the analysis of a static set of data [6]. For example, when assessing the health state of a patient in a hospital, the sensor data of a previous time-span can be used to recognize the corresponding activities or to identify a change of behaviour. The advantage of such an analysis is that all temporal relations can be exploited allowing better accuracy since for every instance past and future events are available. In on-line recognition [34,40], the case we focus on, the analysis is done from a data stream while the subject is performing the activity. In this case the aim is to identify as quickly as possible the current activity at a certain instant relying only on past and present information.

Approaches for activity modeling can be divided mainly into two categories: knowledge-driven and data-driven. In the former category, a logic-based approach offers an ideal framework to model explicit knowledge which can be provided by an expert of the domain. Ontologies have been widely used for AR [16] since they provide readability and formal definitions while the inference can be performed by an ontology reasoner as a problem of satisfiability. Moreover, under a description-based approach, logic rules facilitate the implementation of expert knowledge within a model [70]. For instance, Augusto and Nugent [5] used logical models to represent the temporal relations among events to recognise activities. In Artikis et al. [3], Event Calculus (EC) has been used for AR because of its capacity to model complex activity and temporal relations. EC has also been used for behaviour reasoning by Chen et al. [15] in a framework aiming at assisting a person in a smart environment. Though logical approaches are highly expressive, they do not handle uncertainty whereas input data in smart home are highly noisy.

Data-driven approaches can be either unsupervised or supervised. Unsupervised activity recognition is pertinent when it is not required to recognize specific activities; for instance, in applications intended to recognize a change in the daily pattern of the inhabitant. Some relevant works [18,49] have studied methods to discover recurrent patterns, or motifs, from a stream of sensor data; other approaches consider the segmentation and clustering of the data in order to create models that can subsequently label a segment in one of the clusters [24,62].

In the case of supervised learning methods, the AR model is learnt by means of an annotated corpus. In most cases the training corpus is exploited in order to find the best parameters of the model. However the structure of the model can also be inferred automatically, for instance, by the induction of logical rules [4]. Many works have applied statistical methods in order to classify sets of sensor data produced over a short time interval as belonging to a particular activity [11,26]. As information in pervasive environments is uncertain in most cases, probabilistic approaches are suitable candidates to be applied for AR, although they assume a probabilistic independence between consecutive time intervals, which is often a false assumption. One of the most applied methods to include temporal relations in the model is dynamic Bayesian networks [84,86]. Activity recognition has also been treated like a problem of sequence labeling: to label a segment of sensor data into the most probable activity performed. Thus, modeling activities by Hidden Markov Models (HMMs) is extensive [23,52,83]. For instance, Duong et al. [23] extended a conventional HMM to model the duration of an activity, and Naeem et al. [52] defined activities as a composition of tasks modeled by hierarchical HMMs. During recent years, conditional random fields (CRFs) [42] have also been widely applied to AR. In particular, Chieu et al. [17] presented an application of CRFs for AR using physiological data. Nazerfard et al. [54] and Vail et al. [81] showed that CRFs can give better results than HMMs since they do not assume the probabilistic independence of the observation variables. Tong and Chen presented a method using Latent-Dynamic CRF for recognizing activities in smart homes [74].

Recently, Statistical Relational Learning (SRL) [29], a sub domain of machine learning, has gained much attention as it integrates elements of first order logic and probabilistic models. Under the SRL scheme, models are defined in a formal logical language that makes them reusable and easy to verify, that systematically takes uncertainty into account, and that allows easy inclusion of a priori knowledge. SRL has recently attracted attention in the domain of human activity modelling and recognition. For instance, Logic HMMs [38] and relational Markov networks [73] are both SRL methods that were considered for AR [33,46,53,58]. In our work, we applied Markov Logic Networks (MLN) [66], which become Markov networks when their predicates are grounded during the inference process. It is also possible to define a MLN which is equivalent to a dynamic model such as a linear CRF.

Some other researchers have carried out comparative studies on the application of machine learning methods [2,81]. However these works have focused mainly on the properties of the methods that make some of them more appropriate for AR than others. We consider it essential to extend these studies through the analysis of the inherent characteristics of the problems relative to this recognition task, such as the most influential sensor information for AR, or the importance of historical information in this specific task. Moreover, an analysis of state-of-the-art sequential methods compared to non-sequential methods for modelling historical information statically can shed light on the AR problem. Another original aspect of the present work with regard to the state of the art is that our evaluation is done under the assumption of on-line recognition, where future information is not available.

3. Method

Fig. 1.

Diagram of the overall methodology for activity model determination.

Our approach for on-line activity recognition from audio and home automation sensors is detailed in this section. In Smart Homes, AR can be performed from a set of very heterogeneous raw data streams of various sensors, such as binary presence detectors (Presence Infra-Red sensors or PIR), continuous microphone signals or temperature measurement. To handle this heterogeneity, the overall strategy we adopted is to summarise data from these sensors within temporal sliding windows to generate vectors of attributes that will feed into an activity classifier. This approach relies on the hypothesis that each instance of any activity is composed of a set of events whose observations are captured by the set of sensors. These observations are signatures of the activities and they can be described by statistics of predefined variables computed over temporal windows shorter than the minimal activity duration. Although activities captured in this manner might be large scale activities, we showed that they can provide sufficient contextual information to an home automation decision module [13].

The method to recognise activities from streams of raw sensor data goes through different levels of abstraction, as depicted in Fig. 1. The raw data are composed of symbolic timestamped values (from, e.g., infra-red sensors), state values (from e.g., switches), irregularly sampled signals (e.g., temperature) and equi-distantly sampled signals (from, e.g., microphones). Some of these data are pre-processed to extract higher-level information such as speech, non-speech sounds and the location of the inhabitant. This step is detailed in Section 3.1. Then, all the raw and inferred information is summarised as vectors of features $V_{n}$ , each of which corresponds to a temporal window $W_{n}$ of duration T. The feature vector comes together with an activity label $A_{n}$ that is generated from the ground truth by taking the activity having the longest duration within $W_{n}$ as the label. As the number of features can be very large, a feature selection step is performed to discard redundant and uninformative features, resulting in feature vectors $V_{n}^{'}$ . $V_{n}^{'}$ together with $A_{n}$ are used as input to the activity model learning schemes. All of the classification models are trained using supervised machine learning techniques described in Sections 3.3 to 3.6.

This section summarises the pre-processing stage, and details the attributes and the classifier models.

3.1. Generating the vectors of attributes

The raw data captured within the Smart Home (see bottom of Fig. 1) are summarised by features computed over a temporal window. This section details the windowing strategy applied and the features computed.

3.1.1. Windowing strategy

In this paper, the aim is to build classification models for on-line processing. In on-line processing, only current and past information is available. This means that, for each current time t, the temporal windows W will cover the interval $] t - T, t]$ . For the sake of clarity, we will call $W_{1}$ the temporal window representing the interval $] 0, T]$ , $W_{2}$ the temporal window representing the interval $] T, 2 * T]$ , $W_{n}$ the temporal window representing the interval $] (n - 1) * T, n * T]$ , etc.

Given the dynamic nature of the activities, T must be chosen to be shorter than the minimal duration of an activity instance, but should be long enough to benefit from the past history. A problem with fixed-size windows is that an activity can be under-represented due to the windows boundaries (e.g., an activity covered by half each of two temporal windows). To solve this problem, overlapping of intervals in 0% to 50% of T may be used. In case of the intersection rate α between two consecutive windows then, $\forall n > 1$ , $W_{n}$ covers the following interval: $\begin{matrix} ] (n - 1) * (1 - α) * T, (n * (1 - α) + α) * T] . \end{matrix}$

Finding the best values for T and α is a tedious task that requires testing each value for classification method on datasets partitioned with different combinations of T and α. In a previous work [11], we tested values for T of 60 and 120 seconds, with values for α of 0, 0.33 and 0.5 and we found that $T = 60 s$ and $α = 0.5$ gave the best results. Based on these results, all the experiments reported in this study were conducted using $T = 60 s$ and $α = 0.5$ .

3.1.2. Localisation and speech/non-speech sound detection

The raw data contains information that must be extracted to enhance activity recognition. Two types of information are considered: speech/non-speech sound event – which are important for activities of communication – and localisation of the inhabitant – which is of primary importance for activity recognition.

Speech/non-speech sound detection In this approach, sound events are detected in real-time by the AuditHIS system [78]. Briefly, the audio events are detected in real-time by an adaptive threshold algorithm based on a wavelet analysis and an SNR (Signal-to-Noise Ratio) estimation. The events are then classified into speech or everyday life sound by a Gaussian Mixture Model. The microphones being omnidirectional, a sound can be recorded simultaneously by multiple microphones; AuditHIS identifies these simultaneous events. For more details about the audio processing, the reader is referred to [78].

Localisation In Smart Homes, localisation can be performed using cheap infra-red sensors detecting human movements but these sensors can lack sensitivity. To improve this, our approach fuses information coming from different data sources, namely infra-red sensors, door contacts and microphones. The data fusion model is composed of a two-level dynamic network [12] whose nodes represent the different location hypotheses and whose edges represent the strength (i.e., certainty) of the relation between nodes. This method has demonstrated a correct localisation rate between 63% and 84% using uncertain data from several sensors.

3.1.3. Computed features

The traces generated from human activities are difficult to generalise, even in a given setting, due to the high inter and intra-person variability of realisations of a same task. This is why statistical attributes and inferred information were chosen to summarise the content of each window.

For all the binary sensors (e.g., infra-red motion detectors, switches), the number of firings in a time frame was computed. For all the contact-door sensors (e.g., doors, windows, furniture, curtains), the number of state changes was computed for each temporal window. For all events for which the duration is important (e.g., speech occurrences), the number of detections and their duration as a percentage of the temporal window were computed. For all signals (CO₂ level, temperature, humidity, brightness, water or electricity), the difference of mean value between time frames was computed. Regarding location, the percentage of time of occupation of each room was computed for each time window. Moreover, to add past information, the previous main occupied room is added as a feature. Finally, to account for the level of “activeness”3

³
In this paper, we distinguish the activity – i.e., the task being performed – from the activeness – i.e., the state of being active. It is also called ‘total agitation’ in the paper from the French agitation equivalent to bustle in English.

of the person within the home, the number of events per temporal window for each of the categories: room, doors, electricity, water, non-speech and speech sounds were summed up and divided by the frame duration.

Most of the activities under consideration in this study have a sequential pattern (e.g., sleeping implies going to the bedroom, then to lie down on the bed and to make no or infrequent large movements, dressing implies to get clothes and to make movements to put them on, etc.). However, in this windowing approach most of the temporal information within the temporal window is lost. But, given the high variability in the sequence of events for a simple activity even by the same person, we claim that such abstraction is a way to eliminate intra-class variations and noise in order to obtain a better generalization. Moreover, the duration of the windows being short, the hypothesis is that the sequential nature of the activities can be captured through sequence based models. Another advantage of these features is their very low computational cost.

3.2. Known and unknown activities

The activities under consideration in the study are inspired by the well known Activities of the Daily Living introduced by Katz [37] which are often used in geriatric assessments (dressing, feeding, toileting, etc.). The chosen activities, slightly different in the two experiments, are detailed in Sections 4.2.1 and 4.2.2. They were chosen mainly to provide contextual information for decision making (e.g., for a voice-based home automation system [13]) but also to provide relevant information about the behaviour of the user.

Another class was also considered in the study: the Unknown class (also called the NULL hypothesis) that represents periods during which it is not currently known which activity is being performed (or labelled in the case of the training dataset). Indeed, as reported in other studies [27,40], a large number of the activities recorded in the user’s home are either transient activities not identifiable by the classifier (e.g., movement between rooms, wandering) or irrelevant activities. In our approach, we handle the Unknown class by considering it as a class label. Although acquiring a unique model of such a mixture of situations is of low interest for knowledge acquisition, its inclusion challenges the other learned classes and leads to more accurate learning of the “known” classes. It must be emphasized that, in further experiments, when the dataset is considered without the Unknown class, all the windows that are labelled as Unknown class are excluded from the training and testing set.

3.3. Activity modelling by Hidden Markov Model (HMM)

Hidden Markov Models (HMMs) [61] (Fig. 2) are extensively used in activity recognition, for which it has become a “standard” approach [23,52,83]. One use of HMM in AR is to compute the most probable sequence of hidden states $Y = {y_{1}, y_{2}, \dots, y_{n}}$ given a sequence of observations $X = {x_{1}, x_{2}, \dots, x_{n}}$ . In this paper we focus on ergodic HMMs, that is HMMs with fully connected hidden states (i.e., it is possible to reach any hidden state from any other hidden state).

Fig. 2.

Representation of a classical HMM for labelling elements in a sequence: $y_{i}$ are the hidden states, $x_{i}$ are the observations.

Ergodic HMM models are based on two assumptions. The first one, which is true for any HMM, is the conditional independence of the observations. An observation $x_{t}$ , emitted at time t, does not depend on any other observation, given the state that generated it. In many cases this assumption is false, but still works well in practice. For example, if one observation is the location of the subject, this variable is not independent between consecutive states. Indeed, depending on the organization of the habitation, when we have a location at a time t, the one at time $t + 1$ is in a restricted subset that depends on the first observation (the same location and the location that are just near this one). The second assumption, which is prevalent in first order HMM (such as the ergodic HMMs considered in this study) follows the Markov principle that the probability of the HMM being in state i at time t depends only on the state value at time $t - 1$ . That means that the activity performed within a certain temporal window is independent of all the other previous temporal windows except the preceding one. This is considered to be an acceptable assumption in our AR application.

To model activities, a separate model was trained for each activity. Each hidden state was modelled by a Gaussian Mixture Model. The learning process was consequently carried out for each of the different activities (eating, dressing, etc.). This consisted in estimating the initial probabilities, the parameters of the GMMs (using the EM algorithm), and probabilities of the observations for each state and the state transition matrix. Convergence to the final parameters was obtained via the Baum-Welch algorithm. Finally, models with 2 hidden states and a GMM with 3 Gaussians for each state were obtained.

The AR was then performed by computing the log-likelihood of each of the N activity models with an input vector. We consider that we are handling the data as a datastream, so we also do not try to determine the frontiers of each activity performance. Sequencing the datastream and adapt the model could be part of future works. During these tests, only the current and previous windows were considered. The HMM with the maximum likelihood was retained as the most probable class of the input sequence.

3.4. Activity modelling by Conditional Random Fields (CRFs)

Conditional Random Fields (CRFs) are graph based models to perform discriminative probabilistic inference over a set of variables in classification tasks [42].

Similarly to HMMs, CRFs can classify a sequence of variables $Y = {y_{1}, \dots, y_{n}}$ for a given sequence of observations $X = {x_{1}, \dots, x_{n}}$ . However, CRFs are not generative models, so they are not intended to model the joint distribution $p (X, Y)$ . CRFs are discriminative models instead, they model the conditional distribution $p (Y | X)$ , but without the requirement to model the distribution of the variable X. Graphically, a CRF is represented by an undirected graph, as shown in Fig. 3. In the case of activity recognition, we can consider X, to be a set of vectors describing temporal windows, and hidden variables Y whose inferred value corresponds to the most probable activity which generated the observations.

Fig. 3.

Representation of a CRF for labelling elements in a sequence.

Lafferty et al. [42] defines CRF as follows:

Let $G = (V, E)$ be a graph such that $Y = {(Y_{ν})}_{ν \in V}$ , so that Y is indexed by the vertices of G. Then $(X, Y)$ is a conditional random field if, when conditioned on X, the random variables $Y_{ν}$ obey the Markov property with respect to the graph, that is: $\begin{matrix} p (Y_{ν} ∣ X, Y_{ω}, ω \neq ν) = p (Y_{ν} ∣ X, Y_{ω}, ω \sim ν) \end{matrix}$ where $w \sim v$ means that w and v are neighbours in G.

Therefore, the probability of a node is conditioned by its neighbours and by the set of observations. CRFs are generally implemented as log linear models by means of feature functions $f_{k}$ . In the case of linear conditional random fields, the simplified equation for estimating $p (Y ∣ X)$ is the following: $\begin{matrix} (1) & p (Y ∣ X) = \frac{1}{Z} exp {\sum_{k = 1}^{K} λ_{k} f_{k} (y_{t}, y_{t - 1}, x_{t})} \end{matrix}$ where $f_{k}$ are feature functions defined on subsets of Y and X, Z is a normalization factor, and $λ_{k}$ is a parameter to assign a weight to the feature function $f_{k}$ . These weights are estimated during the learning phase. The algorithm that was used is L-BFGS, a quasi-newton method that aproximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm.

When considering a model to label temporal windows as performed activities, having $y_{t}$ as the activity at time t and only one evidential variable $x_{t}$ representing the location of the user at time t, an example of feature function can be: $\begin{matrix} f_{k} (y_{t}, y_{t - 1}, x_{t}) = \{\begin{matrix} 1 & if (y_{t} = “sleep”, \\ y_{t - 1} = “clean”, \\ x_{t} = “bedroom”) \\ 0 & Otherwise \end{matrix} \end{matrix}$ The feature functions $f_{k}$ take the value of 1 if their variables are set to the values specified in the function, 0 otherwise.

In our implementation of CRF for activity recognition the evidential variable $x_{t}$ is not a single value but a vector $V_{t}^{'} = x_{t} = {x_{t}^{1}, \dots, x_{t}^{m}}$ where each $x_{t}^{i}$ is the value of attribute i at time t. Thus, m feature functions $f_{1} (y_{t}, y_{t - 1}, y_{t - 2}, x_{t}^{1}, x_{t - 1}^{1}, x_{t - 2}^{1}), \dots, f_{m} (y_{t}, y_{t - 1}, y_{t - 2}, x_{t}^{m}, x_{t - 1}^{m}, x_{t - 2}^{m})$ , one for each attribute, were defined. These features function are dependent on the values of the current windows and the two previous ones.

3.5. Activity modelling by dynamic Markov Logic Network (MLN)

A Markov Logic Network (MLN) is a generative statistical relational model that combines First Order Logic (FOL) and probabilistic inference. A MLN model is expressive enough to include explicitly, by means of FOL, the main relations that exist among the elements of the smart environment involved in the activity recognition. In addition, every logical formula is given a numerical weight indicating a degree of truth. This logical representation along with its set of weights can be considered as a meta model that, during the inference process, allows the construction of a Markov network, a pure probabilistic model that can deal with uncertain variables. In this section, we introduce formally the MLN model and our implementations for activity recognition.

A MLN is composed of a set of FOL formulae, each one associated with a weight that expresses a degree of truth. This approach softens the assumption that a logic formula can only be true or false. A formula f is grounded by substituting each variable in f by a constant. A grounded formula that consists of a single predicate is a ground atom. A set of ground atoms is a possible world. All possible worlds in a MLN are true with a certain probability which depends on the number of formulae satisfied and the weights of these formulae. Let’s consider F a set of first-order logic formulae, with $w_{i} \in R$ the weight of the formula $f_{i} \in F$ , and C a set of constants. During the inference process, each MLN predicate is grounded and the Markov network $M_{F, C}$ is constructed where each random variable is instanced with a ground atom. An edge is created for every pair of variables representing predicates that appear in the same formula. The obtained Markov network allows the estimation of the probability of a possible world $P (X = x)$ by the Eq. (2): $\begin{matrix} (2) & P (X = x) = \frac{1}{Z} exp (\sum_{i} w_{i} n_{i} (x)) \end{matrix}$ where $Z = \sum_{x^{'} \in χ} exp (\sum_{i} w_{i} n_{i} (x^{'}))$ is a normalisation factor, χ the set of the possible worlds, and $n_{i} (x)$ is the number of true groundings of the ith clause in the possible world x. When the number of predicates and the size of the domain of the variables grows, exact inference in MLN becomes intractable, so Markov Chain Monte Carlo methods are applied to approximate $P (X = x)$ [66]. In our case, recognizing an activity consists of finding the activity a in A that maximises $P (X = a, e)$ , where e is the evidence represented by the value of the attribute in each window (e.g., the values of the $V^{'}$ vector). Learning a MLN consists of two independent tasks: weight learning and structure learning. Weight learning can be achieved by maximizing the likelihood with respect to a training set. If the ith formula is satisfied $n_{i} (x)$ times in x, then by using Eq. (2), the derivative of the log-likelihood with respect to the weight $w_{i}$ is given by Eq. (3): $\begin{array}{l} \frac{\partial}{\partial w_{i}} log P_{w} (X = x) \\ (3) & = n_{i} (x) - \sum_{x^{'} \in χ} P_{w} (X = x^{'}) n_{i} (x) \end{array}$ where $x^{'}$ is a possible world in χ. The sum is thus performed over all the possible worlds $x^{'}$ and $P_{w} (X = x^{'})$ is $P (X = x^{'})$ computed using the vector $w = (w_{1}, \dots, w_{i}, \dots)$ . The maximisation of the likelihood is performed by an iterative process converging towards an optimal w. Unfortunately, doing this maximisation (3) is intractable in most cases. Thus, approximation methods are used in practice such as the Scaled Conjugate Gradient method [48].

The implementation proposed for Activity Recognition uses a set of rules which models the relationship between each feature and the activity independently from the other features. Formally, if N discrete features are used for classification, the possible values for a feature i is given by the set ${Values}_{i} = {V_{i, 1}, \dots, V_{i, | {Values}_{i} |}}$ , and the activities considered are $Classes = {A_{1}, \dots, A_{c}}$ .

The rules used to classify activities have the following structure ${feature}_{i} (W, V_{i, j}) \Rightarrow class (W, A_{k})$ where the variable W is the temporal window to be classified. The following rules are examples of this pattern: $\begin{array}{l} SoundsKitchen (W_{1}, I_{2}) & \Rightarrow class (W_{1}, Eating) \\ SoundsKitchen (W_{1}, I_{3}) & \Rightarrow class (W_{1}, Eating) \\ TotalAgitation (W_{1}, LOW) & \Rightarrow class (W_{1}, Eating) \\ TotalAgitation (W_{1}, HIGH) & \Rightarrow class (W_{1}, Eating) \end{array}$

The total number of rules in the model is given by $\sum_{i = 1}^{N} | {Values}_{i} | . | Classes |$ . As two grounded predicates $class (W_{i}, A)$ and $class (W_{j}, A)$ , for $i \neq j$ , never appear in the same formula, in the resulting Markov network the probability of an activity being performed in a certain temporal window is independent from the other windows. This structure is then similar to a logistic regression model as shown in Fig. 4. This model is called the Naive MLN in reference to the Naive Bayes network.

Fig. 4.

Ground Naive MLN.

In addition, we implemented a dynamic model that represents the activity recognition problem as a stochastic process in time. In this model, we use the same predicates, but the identifiers of the temporal windows become time arguments whose values are positive integers. We also introduce a temporal predicate $Previous$ that defines the order of the time arguments; for instance, $Previous (W_{2}, W_{3}), Previous (W_{3}, W_{4}), \neg Previous (W_{5}, W_{4})$ . The basic rules composing the Naive MLN remain the same with the addition of the following rules: $\begin{array}{l} Previous (W_{1}, W_{2}) \land class (W_{1}, A_{i}) \\ \Rightarrow class (W_{2}, A_{j}) \forall A_{i}, A_{j} \in Classes \end{array}$

The purpose of these rules is to establish a sequential relation between consecutive temporal windows. In the ground Markov network two predicates $class (W_{i}, A)$ and $class (W_{j}, A)$ can belong to the same clique, i.e. there is a probabilistic dependency among them. In the case of learning the weights of these formulae are learned from the ground truth, in the case of inference, the value of the previous class is provided by the preceding inference. This model is called the dynamic MLN or more simply MLN.

3.6. Activity modelling by non-sequential methods: SVMs and random forests

The two last methods considered in this study were Support Vector Machines (SVMs) and Random Forests. They are classification algorithms that have been executed on each temporal window independently. These two algorithms have previously been used for activity recognition and have demonstrated good performances (e.g. [21,25,26]).

The processing behind a SVM is to project an input vector into a feature space using a kernel (we choose here a Gaussian kernel, for which we have first to determine the standard deviation σ – the first parameter of the model). From the projected vectors, the learning algorithm determines the best possible separation hyperplane between the individuals of two classes, that is the hyperplane at the largest distance from all the points belonging to each class, called margin. A second parameter, C, controls the trade-off between the size of this margin and the number of possibly misclassified training samples. This algorithm, originally developed by Vapnik et al. [8] has demonstrated a very good efficiency on different kinds of classification tasks.

A Random Forest (RF) [10] is an ensemble classifier composed of several decision trees. For a new input, each decision tree decides a class and a voting strategy is used to determine, among the several trees, which class to attribute to the input vector. The induction of a RF combines random subspaces and bagging. It constructs a decision tree using a randomly selected reduced number of attributes (the number of trees created is a parameter of the algorithm).

For more details about these well known and documented models the reader is referred to the previously cited papers. The description of the determination of each of the parameters is provided in Section 4.3.2.

4. Experiments and results

The methods were applied on data collected in two Smart Homes during two experiments involving respectively 21 persons and 15 persons. This section describes the Smart Homes (Section 4.1), the data sets (Section 4.2) that were acquired and the attribute selection and model parametrisation (Section 4.3). At the end, the results of the activity recognition are presented in Section 4.4.

4.1. Pervasive environments

4.1.1. The Domus Smart Home

The first pervasive environment considered is the Domus Smart Home that was designed by the Laboratoire d’Informatique de Grenoble (LIG) [28]. This flat was extensively used in the Sweet-Home project for experiments. Figure 5 shows the details of the flat. It is a thirty square meters flat including a bathroom, a kitchen, a bedroom and a study room, all equipped with sensors and actuators such as infra-red movement detectors, contact sensors, video cameras (used only for annotation purposes), etc. In addition, seven microphones were set in the ceiling for audio analysis. The flat is fully usable and can accommodate a dweller for several days. The technical architecture of Domus is based on the KNX bus system (www.knx.org), a worldwide ISO standard (ISO/IEC 14543) for home and building control. Besides KNX, several field buses coexist in Domus, such as UPnP (Universal Plug and Play) for the audio video distribution or X2D for the detection of the opening and closing of doors, windows, and cupboards. More than 150 sensors, actuators and information providers are managed in the flat (e.g., lighting, shutters, security systems, energy management, heating, etc.). Sounds are recorded independently to other sensors data thanks to a National Instrument multichannel acquisition board and analyzed by the AuditHIS software [78].

The Domus flat was designed to be as normal as a standard flat, so that the participants moving in the smart home would behave as naturally as possible, performing activities in as close as possible to their usual manner.

Fig. 5.

The Domus Smart Home of the LIG.

4.1.2. The HIS Smart Home

The second Smart Home has been set up inside the Faculty of Medicine of Grenoble by the researchers of the TIMC-IMAG laboratory. This 47 m²-flat is composed of a bedroom, a living-room, a hall, a kitchen (with cupboards, fridge…), a bathroom with a shower and a toilet. It was equipped with: (1) infra-red presence sensors (PIR), placed in each room to sense specific locations in the flat; (2) door contacts for the detection of the use of some of the commodities (fridge, cupboard and chest of drawers); (3) microphones, also in each room, to monitor, record and process all the sounds inside the flat and classify them into sounds of daily living or speech; and (4) large angle webcams (for annotation purposes only).

All the sensors, their location and also the organization of the flat are presented in Fig. 6. The basis of the flat is the wireless presence infra-red (PIR) sensor, used in the AILISA project to monitor the level of activity of the person [44]. The other sensors (i.e. the microphones, webcams, environmental and contact sensors), that have been added to the initial AILISA platform, are optimally distributed to the four computers of the technical room (to optimize both resources and processing time use). This room, next to the Health Smart Home (HIS), contains these computers and electronic devices that receive and store, in real time, the information from the HIS. These computers are from standard ones. Apart from the microphones that need a National Instrument multichannel acquisition board for the analog to digital conversion of the signals from the microphones, the other connections are done with serial or USB ports.

Fig. 6.

Equipment and layout of the Health Smart Home (HIS) of the TIMC-IMAG Laboratory in Grenoble.

4.2. Experimental data

This section details the acquisition and the characteristics of each of the two datasets and the procedures followed to record them before describing the manner they were used in this study. In all cases, cameras were used for video recording in each room for annotation purpose only, except in the toilet and bathroom in which there were no cameras in order to respect privacy.

4.2.1. The multimodal Sweet-Home (SH) dataset

The multimodal Sweet-Home dataset is part of the Sweet-Home corpus [80], in the following we will refer it by SH dataset. For the record, 21 persons (including 7 women) participated to the experiment to record all sensor data in a daily living context. The average age of the participants was 38.5 ± 13 years (min 22, max 63), height 1.72 ± 0.13 m (min 1.43, max 2.10), and weight 70 ± 14 kg (min 50, max 105). Participants were asked to enter the flat and behave as if they were in their own home. Before this, the experimenter organised a detailed visit of the flat for each participant to make sure that the individual would not search for everyday items and would feel like being at home. Then, the individual was asked to perform each of the previously defined activity at least once. No time constraint was given to perform these activities. The list of activities the person was asked to perform is the following: (1) to close the door with the electrical lock, (2) to undress/dress, (3) to wash hands, (4) to prepare a meal and to eat, (5) to wash dishes, (6) to brush his/her teeth, (7) to have a nap, (8) to rearrange the bed, (9) to do the housework, (10) to read a book and to listen to music, (11) to have a phone call, (12) to go out of the flat for shopping and to come back, (13) to use the PC and to call a relative, (14) to undress, take a shower and go to the bed for sleeping.

In total, thanks to the 21 participants, more than 26 hours of data have been acquired with an average scenario duration of one hour. About 18 hours were kept for an activity recognition experiment while the remaining time was retained for specific audio analysis. Data were annotated with the 7 classes of activity using the Advene software.4

⁴
liris.cnrs.fr/advene.

This corpus is freely available at http://sweet-home-data.imag.fr/. For our study, this corpus was divided into two independent parts: the tuning set and the train-test set. For both datasets, the tuning set was used to train the localisation algorithms as well as to learn a discretisation model to provide discrete values to the MLN classifier. This tuning set was excluded from the train-test set to avoid overfitting. It is important to note that the tuning set was also used to tune the parameters of the classifiers.

4.2.2. The HIS corpus (HIS)

The HIS corpus [27] was acquired to monitor the activity of a person living alone at home, with the aim of helping geriatricians to evaluate the dependency level of various elderly people [26]. Seven activities were selected to be classified automatically: Preparing and having a meal, Performing hygiene activities, Dressing and undressing, Sleeping or having a nap, Resting and Communicating with relatives on the phone, and finally the Elimination activity (the fact to be in the toilet and using it). Each person was asked to perform the different activities for as long as they wanted and as often as they wanted. They were only instructed of the different activities to perform but not of the order in which to do them or of the way to perform them. The activities were performed by 15 healthy and non-elderly subjects (six women and nine men).

In total, about 13 hours of data have been acquired in the HIS flat. The average age of participants was 31.9 ± 9.0 years (min 24, max 57), the average height 1.74 ± 0.11 m (min 1.62, max 1.92), and the average weight 68.5 ± 9.11 kg (min 50, max 81). The mean execution time of each experiment was 51 min 40 s for a single participant (min 23 min 11 s, max 1 h 35 min 44 s). This corpus is freely available at http://getalp.imag.fr/HISData. Data were annotated using the same conventions as for the SH corpus. For our study, in the same way as for the previous corpus, it was divided into two independent parts: the tuning set and train-test set.

4.2.3. Implementation for evaluation purpose

From the raw data recorded by the different sensors of the flat, a vector V of attributes was extracted for each temporal window as described in Section 3.1. The duration T of the window W was set to 60 s. The attributes were inferred or computed from the different signals. This resulted in an attribute vector V of 94 elements in the case of the SH dataset and of 26 elements in the case of the HIS corpus.

Tables 1 and 2 present the distribution of the activity classes for 60 second frames for the SH corpus and the HIS corpus. The first column represents the activity classes, the second column shows the percentage of data that has been put apart for preprocessing and tuning, the third column is the percentage of data used for training and testing the activity classification models and the last column presents the total. For SH, the preprocessing and tuning set was composed of the data from participants 8, 10, 11, 13, 14, 15 while for HIS it consisted of data from the last four participants.

Table 1
Distribution of time windows for each activity in each SH dataset part ( $T = 60 s$ )

Class Tuning set Train-Test set Both sets

Cleaning 19.4% 20.6% 20.2%

Dressing/Undressing 2% 2.6% 2.4%

Eating 31.2% 27.2% 28.3%

Hygiene 6.5% 6.8% 6.7%

Phone 2.3% 2.9% 2.7%

Reading/Computer/Radio 23.5% 22.9% 23.1%

Sleeping 10.8% 12.8% 12.3%

Unknown 4.3% 4.2% 4.2%

Number of participants 7 14 21

Number of windows 603 1526 2129

Duration 5 h 01 m 30 s 12 h 43 m 00 s 17 h 44 m 30 s

Class	Tuning set	Train-Test set	Both sets
Cleaning	19.4%	20.6%	20.2%
Dressing/Undressing	2%	2.6%	2.4%
Eating	31.2%	27.2%	28.3%
Hygiene	6.5%	6.8%	6.7%
Phone	2.3%	2.9%	2.7%
Reading/Computer/Radio	23.5%	22.9%	23.1%
Sleeping	10.8%	12.8%	12.3%
Unknown	4.3%	4.2%	4.2%
Number of participants	7	14	21
Number of windows	603	1526	2129
Duration	5 h 01 m 30 s	12 h 43 m 00 s	17 h 44 m 30 s

Table 2

Distribution of time windows for each activity in each HIS dataset part ( $T = 60 s$ )

Class	Tuning set	Train-Test set	Both sets
Dressing/Undressing	3.2%	2.1%	2.4%
Eating	19.4%	15.7%	16.6%
Elimination	4.5%	4.6%	4.6%
Hygiene	3.5%	5.2%	4.8%
Phone	7.1%	6.1%	6.2%
Reading/Computer/Radio	22.1%	26.5%	25.5%
Sleeping	22.9%	21.2%	21.6%
Unknown	17.3%	18.6%	18.3%
N^o Participants	4	11	15
N^o Windows	376	1221	1597
Duration	3 h 08 m 00 s	10 h 10 m 30 s	13 h 18 m 30 s

The distribution of classes is unbalanced due to the natural differences in the duration of each daily activity and to the fact that the scenarios were different in the HIS and SH cases. For each experiment, participants were recruited to play a scenario in one of the two smart homes but the scenarios were different. The following section details attribute selection and model parametrisation.

4.3. Attribute selection and model parametrisation

This section details the pre-processing that has been applied to reduce the set of attributes and to tune the classification algorithms. The intervals that were not identified as one of the 7 specified activities were considered as belonging to the Unknown class.

4.3.1. Attribute selection

Performing attribute selection is a necessary step in data mining both to reduce the size of the data and to improve performance [50]. Moreover, for some of our algorithms, the number of features is important and its reduction is crucial for two reasons: (1) the speed of both training and testing can grow exponentially with the number of features and (2) the curse of high dimensionality makes difficult to interpret differences in distances in high dimensional spaces.

Information Gain Ratio (IGR) has been chosen for feature selection because it usually performs well in practice [50] and because it is independent of the classification model (by contrast with wrapping attribute selection methods [68]). IGR is the basis criterion of some decision tree algorithms (e.g., C4.5 [60]) which progress by selecting the best attributes at each decision step from the remaining set of attributes. Recall that information gain is defined considering the entropy and the probability of each values for this attribute currently under consideration. The entropy $H (V)$ of a variable V taking the values v is defined by: $\begin{matrix} H (V) = - \sum_{v} p (v) \cdot {log}_{2} (p (v)) \end{matrix}$ while the entropy of a value V, given a variable X (with its possible values x) is defined by: $\begin{matrix} H (V | X) = - \sum_{x} p (x) \cdot \sum_{v} p (v | x) \cdot {log}_{2} (p (v | x)) \end{matrix}$ The IGR for an attribute $a \in A$ , considering the class $c \in C$ , is then obtained as: $\begin{matrix} (4) & IGR (a, c) = \frac{H (c) - H (c | a)}{H (a)} \end{matrix}$

The formula (4) is applied to each attribute to obtain the score, then a threshold can be chosen to retain the k best attributes.

The computation of the IGR is done on the complete dataset across classes. This gain is determined for each attribute and each class and for each attribute a weighted mean is computed to obtain its final value.

At the end, only those features with non-zero IGR (features including some information) were retained.

Table 3
Attributes selected for the SH dataset using Information Gain Ratio for each attribute (66 attributes out of 94). The best 20 attributes are highlighted

Type Attributes Names for GainRatio

Location PercentageLocationRoom1, PercentageLocationRoom2, PercentageLocationRoom3, PercentageLocationRoom4, PredominantRoom, LastRoomBeforeWindow, NumberOfDetectionPIROffice, NumberOfDetectionPIRKitchen, TimeSinceInThisRoom, PercentageAgitationRooms

Switches SwitchBathroomUse, SwitchBedroomBed, SwitchBedroom, SwitchOffice, SwitchSinkKitchen

Lights PercentageTimeLightBathroomOn, ActivationDeactivationLightBathroomSink, ActivationDeactivationLightBedLeft, ActivationDeactivationLightBedRight, ActivationDeactivationLightKitchenSink, PercentageLightOfficeOn, PercentageLightKitchenSinkOn, PercentageLightKitchenTableOn, PercentageLightBedLeftOn, PercentageLightBedRightOn

Shutter PercentageShutterBedroom, PercentageShutterBedroom2, PercentageShutterDesk2, PercentageShutterKitchen, ActivationDeactivationShutterBedroom, ActivationDeactivationShutterDesk, PercentageCurtain, PercentageShutterOffice

Power PowerLastUse, PowerLastLastUse, PowerLastLastLastUse

Doors and Windows ActivationDeactivationNumberOfDoorBedroom, ActivationDeactivationNumberOfDoorBathroom, ActivationDeactivationDoorCupboardKitchen, ActivationDeactivationDoorFridge, ActivationDeactivationNumberOfWindowBedroomBathroom, PercentageAgitationDoors

Sounds SoundsKitchen, SoundsDinningRoom, SoundsBathroom, SoundsOfficeDoor, SoundsBedroomWindow, SoundsOfficeWindow, SoundsBedroomDoor, SpeechBedroomDoor, SpeechBedroomWindow, SpeechBathroom, SpeechKitchen, SpeechOfficeDoor, SpeechOfficeWindow, SpeechDinningRoom, PercentageAgitationSounds, PercentageAgitationSpeech, PercentageTimeSound, PercentageTimeSpeech

Divers ColdWaterTotal, HotWaterTotal, TotalAgitation, AmbientSensorCO2Bedroom, AmbientSensorTemperatureOffice, AmbientSensorTemperatureBedroom

Class One of: cleaning, dressing up, eating, hygiene, phone, sleeping, reading/computer/radio, unknown activity/transition

Type	Attributes Names for GainRatio
Location	PercentageLocationRoom1, PercentageLocationRoom2, PercentageLocationRoom3, PercentageLocationRoom4, PredominantRoom, LastRoomBeforeWindow, NumberOfDetectionPIROffice, NumberOfDetectionPIRKitchen, TimeSinceInThisRoom, PercentageAgitationRooms
Switches	SwitchBathroomUse, SwitchBedroomBed, SwitchBedroom, SwitchOffice, SwitchSinkKitchen
Lights	PercentageTimeLightBathroomOn, ActivationDeactivationLightBathroomSink, ActivationDeactivationLightBedLeft, ActivationDeactivationLightBedRight, ActivationDeactivationLightKitchenSink, PercentageLightOfficeOn, PercentageLightKitchenSinkOn, PercentageLightKitchenTableOn, PercentageLightBedLeftOn, PercentageLightBedRightOn
Shutter	PercentageShutterBedroom, PercentageShutterBedroom2, PercentageShutterDesk2, PercentageShutterKitchen, ActivationDeactivationShutterBedroom, ActivationDeactivationShutterDesk, PercentageCurtain, PercentageShutterOffice
Power	PowerLastUse, PowerLastLastUse, PowerLastLastLastUse
Doors and Windows	ActivationDeactivationNumberOfDoorBedroom, ActivationDeactivationNumberOfDoorBathroom, ActivationDeactivationDoorCupboardKitchen, ActivationDeactivationDoorFridge, ActivationDeactivationNumberOfWindowBedroomBathroom, PercentageAgitationDoors
Sounds	SoundsKitchen, SoundsDinningRoom, SoundsBathroom, SoundsOfficeDoor, SoundsBedroomWindow, SoundsOfficeWindow, SoundsBedroomDoor, SpeechBedroomDoor, SpeechBedroomWindow, SpeechBathroom, SpeechKitchen, SpeechOfficeDoor, SpeechOfficeWindow, SpeechDinningRoom, PercentageAgitationSounds, PercentageAgitationSpeech, PercentageTimeSound, PercentageTimeSpeech
Divers	ColdWaterTotal, HotWaterTotal, TotalAgitation, AmbientSensorCO2Bedroom, AmbientSensorTemperatureOffice, AmbientSensorTemperatureBedroom
Class	One of: cleaning, dressing up, eating, hygiene, phone, sleeping, reading/computer/radio, unknown activity/transition

SH dataset The feature selection method was applied to the multimodal Sweet-Home corpus (SH dataset). The attribute vector V originally composed of 94 features was reduced to $V^{'}$ of 66 features by using IGR.

Table 3 shows the 66 obtained attributes. In that case, only the attributes that have a non-null information gain were kept. In this table, the 20 attributes having the highest IGR scores are highlighted. It suggests that among the selected attributes those that provide the best information to classify activities are the attributes related to the location of the inhabitant and the acoustic features.

HIS dataset Following the same method as for the SH dataset, the HIS corpus, with data vectors originally composed of 27 features (26 plus the class) was reduced to 24 attributes with IGR. For this dataset, the number of features originally available was really small. That explains why only a few attributes were eliminated by the attribute selection process.

Table 4 sums up the reduced dataset.

Table 4

Attributes selected for HIS using retained non-zero Information Gain Ratio for each attribute (24 attributes out of 26)

Type	Attributes Names for GainRatio
Location	PredominantRoom, LastRoomBeforeWindow, PercentageLocation (in every room), TimeSinceInThisRoom
Doors	ActivationDeactivationCupboardDoor, ActivationDeactivationDressingDoor
Sounds	Sound on all the microphones
Speech	Speech in Entrance, Hall, Shower, WC, Kitchen
Class	One of: dressing up, eating, elimination, hygiene, phone, sleeping, reading/computer/radio, unknown activity/transition

4.3.2. Model tuning

HMM, SVM and random forest tuning A 10-fold cross-validation on each tuning set was performed to optimize several parameters of the classifiers. For the Random Forest, the number of trees has been optimized, for the SVM, the pair $(C, σ)$ has been searched using a grid search and finally for the HMM, the number of states and the number of Gaussians has been determined. For this last one, the optimization has been done on the whole set of activities. This optimization is not done for each individual activity separately (for which an optimal topology of the HMM could perhaps improve the results). The parameter search was performed for each dataset (HIS and Sweet-Home) and the optimal values found for the parameters were kept for the classification.

CRF and MLN tuning The feature functions designed for the CRF model consider the evidential information of the current temporal window and also the two previous ones. We found that using the two previous windows instead of only one, slightly improves the accuracy of the algorithm while keeping an acceptable processing time.

In the cases of the MLN and the CRF, all the continuous numerical variables were discretised. A supervised method for discretisation, CAIM (Class-Attribute Interdependence Maximization) [41], has been run on the tuning set. It resulted in a set of discretisation intervals for each continuous attribute that were applied as a preprocessing stage to the input data of the CRF and the MLN. This algorithm works individually on each feature without the need to fix the number of discrete intervals as parameter. CAIM’s optimization goal is to maximize the class-attribute interdependence while minimizing the number of intervals. The number of intervals found in the datasets was always between 3 and 8. Once again, only the tuning set was used to avoid overfitting.

4.4. Results

Table 5
Overall accuracy (%) results on the two datasets with and without the Unknown class

Model SH dataset HIS corpus

Without Unknown With Unknown diff Without Unknown With Unknown diff

SVM 75.00 71.90 3.10 74.86 64.90 9.96

Random Forest 82.96 80.14 2.82 70.72 62.32 8.40

MLN naive 79.20 76.73 2.47 75.45 66.81 8.64

HMM 74.76 72.45 2.31 77.26 67.11 10.15

CRF 85.43 83.57 1.86 75.85 69.29 6.56

MLN 82.22 78.11 4.11 75.95 65.82 10.13

Model	SH dataset	HIS corpus
SVM	75.00	71.90	3.10	74.86	64.90	9.96
Random Forest	82.96	80.14	2.82	70.72	62.32	8.40
MLN naive	79.20	76.73	2.47	75.45	66.81	8.64
HMM	74.76	72.45	2.31	77.26	67.11	10.15
CRF	85.43	83.57	1.86	75.85	69.29	6.56
MLN	82.22	78.11	4.11	75.95	65.82	10.13

4.4.1. Performance evaluation

The method used to evaluate the classifier was based on Cross-Validation but used a specific type namely Leave-One-Subject-Out-Cross-Validation (LOSOCV). If the dataset is composed of records5

⁵
Here ‘record’ means the full record for a single participant.

from N participants, for each fold, records from

N - 1

participants were used to train the model, while the remaining record was used for evaluating the learned model. Consequently, testing was performed on different individuals from training, and thus LOSOCV prevents participant overfitting.

Performance was assessed using the accuracy measure over the full dataset, defined as: $\begin{matrix} {Acc}_{Global} = \frac{\sum_{i} V_{i}}{\sum_{i} S_{i}} \end{matrix}$ where $V_{i}$ is the number of windows of class i correctly classified as i and $S_{i}$ is the total number of windows of class i. The average accuracy per class was also computed to assess the capacity of the learning method to model each class independently. This was defined as $\begin{matrix} {Acc}_{Class} = \frac{\sum_{i} {Acc}_{i}}{N_{c}} \end{matrix}$ where $N_{c}$ is the total number of classes and ${Acc}_{i} = \frac{V_{i}}{S_{i}}$ , i.e., the accuracy $A_{i}$ for the ith class.

In all the results presented in the Tables 6–9, the overall accuracy is given as well as the mean accuracy and standard deviation, computed over the participants, in brackets.

4.4.2. Preprocessing performance

As presented in Section 3.1.2, two kinds of information were inferred from the raw data: location of the dweller and speech/non-speech sound events.

We adapted a dynamic network for multisource fusion with the aim of locating a participant in the smart home [12]. This process contains two levels: the first corresponds to generating location hypotheses from an event; and the second represents the context for which the activation indicates the most probable location given the previous events. Training was achieved separately on the two tuning sets, SH and HIS datasets (cf. Section 4.2) and gave 84% correct location for each 1 second windows of the Train-Test set of SH and 96% correct with HIS dataset. Thus, though the accuracy is acceptable for SH and excellent for HIS, the activity models are trained on imperfect data that may impact on the learning.

As far as sound processing is concerned, the discrimination module was a Gaussian Mixture Model (GMM) which classified each audio event as either an everyday life sound or a speech sound. The discrimination module was trained with an everyday life sound corpus [36] and with the Normal/Distress speech corpus recorded in our laboratory [79]. Acoustic features were Linear-Frequency Cepstral Coefficients (LFCC) with 16 filter banks and the classifier was made of 24 Gaussian models. Acoustic features were computed for every frame using a size of 16 ms, with an overlap of 50%. On the HIS Train-Test set, the global accuracy of the speech discrimination was 84.61%. 25% of the sounds classified as speech were actually “non-speech sounds” and 13% of the sounds classified as non-speech were actually “speech-sounds”. So the classifier is again imperfect regarding speech/non-speech sound related features.

4.4.3. Global results

Table 5 shows the overall accuracy results for all the classification models and datasets both with and without including the Unknown class. Let’s recall that the case without the Unknown class means that these windows were excluded from the datasets both for the learning and testing stages.

It can be observed that the CRF approach has the highest accuracy in 3 out of 4 conditions but the HMM approach shows the best accuracy for the HIS without including the Unknown class. MLN is always the second or third ranked method. The worst classifiers are the SVM under all conditions and the HMM on the SH dataset and the Random Forest for the HIS dataset (even if this was amongst the best for the SH dataset). For the SH dataset without Unknown class condition, a Kruskal-Wallis test revealed a significant effect for dependency of accuracy on the model ( $χ^{2} (5) = 16.22$ , $p = 0.006$ ). A post-hoc test using pairwise Wilcoxon summed rank tests with Bonferroni correction showed that this dependency is mostly driven by the difference between the CRF and the HMM ( $p = 0.032$ ). When the Unknown class is included, the significance increases ( $χ^{2} (5) = 17.78$ , $p = 0.003$ ), still driven by the difference between the CRF and the HMM ( $p = 0.028$ ) with the difference between the CRF and the SVM just short of significance ( $p = 0.082$ ). None of the HIS results show a significant difference, probably due to the high variability between subjects. When analysing the difference between the conditions both without and including the Unknown class, it can be seen that the CRF has the smallest decrease of performance resulting from including the Unknown class for both datasets, while MLN, HMM and SVM show the biggest decreases. The importance of the different decreases between the two datasets can be explained by the high proportion of Unknown class windows in the HIS dataset (more than 18% of the total dataset) compared with the SH set (about 4%). Overall, CRF seems to be the method with the best performance in most of the conditions. In the remaining sections we focus on the CRF and other dynamic models (HMM, MLN) to study their behaviour in each condition.

Table 6
Classification accuracy using the SH dataset without Unknown class: overall (per participant record $\pm SD$ )

Class HMM CRF MLN

Cleaning 64.8% (66.1 ± 18.9%) 82.80% (84.25 ± 12.93%) 75.16% (76.71 ± 12.39%)

Dressing/Undressing 2.7% (7.1 ± 26.7%) 53.84% (56.11 ± 31.88%) 30.77% (28.33 ± 32.4%)

Eating 76.9% (76.1 ± 28.5%) 85.43% (87.76 ± 16.49%) 83.37% (82.67 ± 18.52%)

Hygiene 79.1% (77.6 ± 25.7%) 79.80% (79.78 ± 23.84%) 78.85% (76.44 ± 26.4%)

Phone 54.8% (55.5 ± 33.8%) 50% (51.97 ± 37.72%) 81.82% (79.76 ± 23.6%)

Reading/Computer/Radio 92.1% (90.2 ± 13.1%) 91.14% (91.08 ± 8.67%) 91.71% (92.76 ± 7.15%)

Sleeping 72.4% (72.1 ± 25.8%) 90.81% (88.81 ± 13.29%) 86.22% (87.29 ± 12.62%)

Global 74.8% 85.43 82.22%

Class 63.25% 76.26% 75.41%

Class	HMM	CRF	MLN
Cleaning	64.8% (66.1 ± 18.9%)	82.80% (84.25 ± 12.93%)	75.16% (76.71 ± 12.39%)
Dressing/Undressing	2.7% (7.1 ± 26.7%)	53.84% (56.11 ± 31.88%)	30.77% (28.33 ± 32.4%)
Eating	76.9% (76.1 ± 28.5%)	85.43% (87.76 ± 16.49%)	83.37% (82.67 ± 18.52%)
Hygiene	79.1% (77.6 ± 25.7%)	79.80% (79.78 ± 23.84%)	78.85% (76.44 ± 26.4%)
Phone	54.8% (55.5 ± 33.8%)	50% (51.97 ± 37.72%)	81.82% (79.76 ± 23.6%)
Reading/Computer/Radio	92.1% (90.2 ± 13.1%)	91.14% (91.08 ± 8.67%)	91.71% (92.76 ± 7.15%)
Sleeping	72.4% (72.1 ± 25.8%)	90.81% (88.81 ± 13.29%)	86.22% (87.29 ± 12.62%)
Global	74.8%	85.43	82.22%
Class	63.25%	76.26%	75.41%

Table 7

Classification accuracy using the SH dataset: overall (per participant record $\pm SD$ )

Class	HMM	CRF	MLN
Cleaning	70.8% (72.4 ± 13.9 %)	79.82% (82.65 ± 14.22%)	81.85% (84.08 ± 14.84%)
Dressing/Undressing	2.7% (7.1 ± 26.7%)	50% (52.78 ± 26.66%)	25.64% (23.89 ± 28.67%)
Eating	77.4% (76.7 ± 23.9%)	83.57% (87.30 ± 15.86%)	77.35% (79.13 ± 22.65%)
Hygiene	68.1% (65.8 ± 25%)	87.88% (77.32 ± 25.29%)	78.85% (77.4 ± 21.61%)
Phone	50% (49.8 ± 35.8%)	54.34% (51.63 ± 38.74%)	79.55% (80.1 ± 23.56%)
Reading/Computer/Radio	92.1% (86.6 ± 19.8%)	91.66% (90.88 ± 7.86%)	87.71% (90.16 ± 10.34%)
Sleeping	66.7% (67.2 ± 26.6%)	89.62% (87.56 ± 13.87%)	84.69% (85.25 ± 14.4%)
Unknown	24.6% (23.2 ± 18.8%)	59.42% (50.37 ± 36.85%)	21.88% (18.96 ± 23.36%)
Global	72.45%	83.57	78.11%
Class	56.55%	74.54%	67.19%

4.4.4. Results on the SH dataset

Detailed results per class both without and with the Unknown class are given in Tables 6 and 7. Without the Unknown class, CRF has the overall best accuracy (85.43%) and averaged over classes (76.26%), closely followed by the MLN performance (82.22% globally and 75.41% per class) and both greatly outperform HMM (74.8% globally and 63.25% per class). CRF shows the best accuracy for most classes (Cleaning, Dressing, Eating, Hygiene, Sleeping) while the MLN has particularly good results for Phone and Reading. Clear superiority of the CRF method is exhibited for Dressing (56.11 ± 31.88%) and Cleaning (84.25 ± 12.93%) while the MLN shows significant superiority in the Phone class (79.76 ± 23.6%). HMM shows good results on Hygiene and on Reading classes but is very poor on Dressing.

When the Unknown class is considered, the pattern remains the same. All the accuracy measures decrease except for both MLN and HMM in the case of the Cleaning class, where the results were slightly improved and the MLN case outperformed the CRF results for that one. The MLN shows again a significant superiority in the classification for the Phone class (80.1 ± 23.56%) over both the HMM (49.8 ± 35.8%) and the MLN (51.63 ± 38.74%). In all other cases, CRF demonstrates the best accuracy.

4.4.5. Results on HIS dataset

Detailed results per class both without and with the Unknown class being included are given in Tables 8 and 9. Without the Unknown class, the HMM has the best accuracy overall (77.3%) and averaged over class (71.0%), being slightly better than the MLN performance (75.95% globally and 68.99% per class) and that of the CRF (75.85% globally and 66.71% per class). The statistical tests did not reveal any significant difference between the models. Moreover, the highest performances for each class are well distributed over the methods, with HMM the best for Dressing, Sleeping and Elimination, with the CRF best for Eating and Reading and the MLN the best for Hygiene and Phone.

Table 8
Classification accuracy using the HIS dataset without Unknown class: overall (per participant record $\pm SD$ )

Class HMM CRF MLN

Dressing/Undressing 46.2% (30.8 ± 40%) 38.46% (23.73 ± 39.97%) 30.77% (26.98 ± 39.55%)

Eating 90.6% (90.2 ± 17.7%) 95.31% (94.81 ± 9.5%) 93.23% (93.43 ± 12.86%)

Elimination 85.7% (65.5 ± 46.7%) 64.29% (46.61 ± 43.35%) 48.21% (33.72 ± 44.45%)

Hygiene 36.5% (35.6 ± 45.1%) 36.51% (41.44 ± 36.28%) 73.02% (52.27 ± 50.56%)

Phone 83.8% (76.6 ± 36.3%) 81.08% (72.09 ± 34.8%) 91.89% (83.93 ± 31.65%)

Reading/Computer/Radio 75.9% (73.8 ± 38%) 77.16% (75.4 ± 35.64%) 75.93% (68.96 ± 39.53%)

Sleeping 78.4% (61.4 ± 41.5%) 74.13% (64.77 ± 32.35%) 69.88% (65.13 ± 37.93%)

Global 77.3% 75.85% 75.95%

Class 71.0% 66.71% 68.99%

Class	HMM	CRF	MLN
Dressing/Undressing	46.2% (30.8 ± 40%)	38.46% (23.73 ± 39.97%)	30.77% (26.98 ± 39.55%)
Eating	90.6% (90.2 ± 17.7%)	95.31% (94.81 ± 9.5%)	93.23% (93.43 ± 12.86%)
Elimination	85.7% (65.5 ± 46.7%)	64.29% (46.61 ± 43.35%)	48.21% (33.72 ± 44.45%)
Hygiene	36.5% (35.6 ± 45.1%)	36.51% (41.44 ± 36.28%)	73.02% (52.27 ± 50.56%)
Phone	83.8% (76.6 ± 36.3%)	81.08% (72.09 ± 34.8%)	91.89% (83.93 ± 31.65%)
Reading/Computer/Radio	75.9% (73.8 ± 38%)	77.16% (75.4 ± 35.64%)	75.93% (68.96 ± 39.53%)
Sleeping	78.4% (61.4 ± 41.5%)	74.13% (64.77 ± 32.35%)	69.88% (65.13 ± 37.93%)
Global	77.3%	75.85%	75.95%
Class	71.0%	66.71%	68.99%

Table 9

Classification accuracy using the HIS dataset: overall (per participant record $\pm SD$ )

Class	HMM	CRF	MLN
Dressing/Undressing	26.9% (10.1 ± 18.8%)	15.38% (4.25 ± 9.74%)	26.92% (18.36 ± 32.73%)
Eating	88% (87.6 ± 20%)	89.58% (87.37 ± 20.25%)	85.94% (84.52 ± 19.05%)
Elimination	76.8% (61 ± 46.9%)	62.5% (42.07 ± 38.89%)	57.14% (47.69 ± 45.89%)
Hygiene	25.4% (32.3 ± 45.9%)	20.63% (26.36 ± 40.56%)	34.92% (34.55 ± 48.24%)
Phone	77% (69.8 ± 32.4%)	78.38% (69.68 ± 35.54%)	60.81% (55.65 ± 36.62%)
Reading/Computer/Radio	74.7% (72.6 ± 37.4%)	75.93% (73.04 ± 38.06%)	69.44% (61.46 ± 44.44%)
Sleeping	79.9% (68.1 ± 37.7%)	70.66% (65.44 ± 36.59%)	71.81% (64.2 ± 39.11%)
Unknown	32.9% (33.9 ± 7.8%)	59.47% (60.33 ± 14.72%)	52.86% (53.43 ± 14%)
Global	67.1%	69.29%	65.82%
Class	60.2%	59.07%	57.48%

When the Unknown class is considered, the pattern changes slightly. CRF gives the highest accuracy globally (69.29%) but not per class (59.07% vs 60.2% for HMM). The best performance per class remains the same with HMM, being still the best for Dressing (equals with MLN), Elimination and Sleeping, CRF for Eating, Phone, Reading and Unknown and the MLN best for Hygiene (equal with HMM). Thus, the overall improvement of CRF is mostly driven by its good classification of the Unknown class, which represents 18% of the HIS dataset. Again the HMM exhibits clearly the best performance for Elimination compared with CRF and the MLN.

4.4.6. CRF performance in discriminating activities

Some classes were more difficult to discriminate between than others, Tables 10 and 11 6

⁶
In these tables are also given sensitivity and specificity. As a reminder, let’s consider TP the True Positive rate, TN the True Negative rate, FP the False Positive rate and FN the False Negative rate. $Sensitivity = \frac{T P}{T P + F N}$ and $Sensitivity = \frac{T N}{T N + F P}$ .

present the confusion matrices for the CRF for the Sweet-Home and HIS datasets in the case where the Unknown class is included. Without surprise, in both corpus, the Unknown class is very uniformly confused with other classes, with stronger consequences for the HIS corpus since instances of the Unknown class constitute a big part of the dataset. For the Sweet-Home dataset, Eating and Cleaning are confused with each other. It should be noted that these two activities were often performed in the same room. The Reading/Computer class exhibits a low specificity, with a lot of confusion with Phone, Dressing and Sleeping. This is not surprising since the Reading/Computer class is composed of different sub-classes which share common characteristics with classes which it gets confused with. For the HIS corpus, Elimination and Hygiene are confused with each other. Again, it should be noted that these two actives were performed in the same area. For the Reading/Computer class a similar trend as for Sweet-Home is observed. This class shares many properties with other classes, notably sleeping. Indeed, Reading consisted if sitting on the sofa and reading a magazine while the Sleeping activity was to lie on the bed doing nothing. These two activities were thus very quiet, evolved very little motion and generated a very low amount of information. Moreover they occurred in the same area (only an open partition separating the bed and the sofa). It is also worth noticing the very low sensitivity for Dressing. This activity was very short and performed between the living room and the bed area. The small amount of examples of this class explains the low performance in learning it.

Table 10

Confusion Matrix for CRF – SH dataset

	Cleaning	Dressing Undressing	Eating	Hygiene	Phone	Reading/Computer	Sleeping	Unknown
Cleaning	255	2	39	1	0	4	1	6
Dressing/Undressing	0	20	1	1	0	3	2	2
Eating	48	1	364	8	0	1	0	6
Hygiene	3	3	3	80	0	1	2	4
Phone	1	0	0	0	23	3	0	0
Reading/Computer/Radio	4	9	1	9	19	319	14	8
Sleeping	1	2	6	4	1	16	175	2
Unknown	2	2	1	1	1	3	2	36
Sensitivity	81.21%	51.28%	87.71%	76.92%	52.27%	91.14%	89.29%	56.25%
Specificity	95.63%	99.40%	94.24%	98.88%	99.73%	94.56%	97.59%	99.18%

Table 11

Confusion Matrix for CRF – HIS corpus

	Dressing Undressing	Eating	Elimination	Hygiene	Phone	Reading/Computer	Sleeping	Unknown
Dressing/Undressing	4	0	0	0	3	1	7	8
Eating	1	172	0	0	0	3	2	23
Elimination	0	0	35	37	0	0	0	7
Hygiene	0	0	18	13	0	1	1	7
Phone	0	0	0	0	58	1	0	7
Reading/Computer/Radio	5	1	0	4	4	246	48	23
Sleeping	7	2	0	0	0	59	183	17
Unknown	9	17	3	9	9	13	18	135
Sensitivity	15.38%	89.58%	62.50%	20.64%	78.38%	75.93%	70.66%	59.47%
Specificity	98.41%	97.18%	96.22%	97.67%	99.30%	90.52%	91.16%	92.15%

Fig. 7.

Accuracy against percentage of windows belonging to the Unknown class for Random Forest and CRF applied to the HIS database.

4.4.7. Subsequent analyses

To assess the impact of including the Unknown class on the learning, the training was performed on the HIS dataset with the best and worst classifiers, CRF and Random Forest, whilst varying the percentage of examples of the Unknown class in the dataset. Figure 7 shows a rapid decrease of Random Forest performance up to 8% of the total dataset being Unknown when it reaches a plateau, while CRF shows a sharp decrease in performance when the Unknown class is being introduced (even by 1% of Unknown examples among all classes), but then the performance decreases very slowly until 10% of examples being Unknown when it reaches a plateau. Thus, although such behaviour calls for further investigation, it seems that the CRF is more robust to the inclusion of the perturbing Unknown class than the Random Forest approach. This is in line with some studies reporting a decrease of performance when a RF is learned from noisy datasets [69].

5. Discussion

Automatic recognition of human activities in smart spaces is an important challenge for Ambient Assisted Living (AAL). In real-world applications, this task would often have to be performed on-line using data from cheap, distant and noisy home automation sensors. In this paper, we present a study to recognize on-line the activities of one dweller from distant (i.e., not worn on the user’s body) home automation sensors (not including any video camera) and microphones using 6 different models: a SVM, Random Forest, dynamic/non-dynamic Markov Logic Networks, a M and CRF. This study, performed on 2 realistic and publicly available datasets, sheds light on the limitations and advantages of these models for the activity recognition tasks which are discussed below.

To achieve the on-line real-time classification requirement of the study, the sequential models (HMM, CRF and MLN) were only learned and applied using the past history, meaning that no future data are known at the time of making the decision. Moreover, on-line and realistic activity recognition in the home must deal with Unknown activities [40] as well as taking into account the transitions between activities (i.e., segmentation of the activities). This is a radical difference from many off-line studies [6,26,54,85] that induce models using both the past and future history, sometimes apply preprocessing to the overall dataset and often use a cross-validation over all the activity windows without considering the detection problem. In our case, both preprocessing and classification is performed without the use of any information on future values.

The results of this study shows that the sequential models, as a group (HMM, MLN, CRF), do not significantly outperform the non-sequential models (SVM, Random Forest). This can be explained by two reasons. Firstly, AR is highly dependent on the location, the presence of this information alone in the temporal windows is important enough to allow accuracy classification in instance-based models. Secondly, the design of the features for classification in our method allows the inclusion of historical information in the temporal window. For instance, the time the person has spent in the same room is accumulative from one window to the next one, as long as the person does not change location. However, it must be emphasized that a sequential model is always ranked first in all conditions (CRF three times, HMM once). It can then be concluded that CRF is generally the best suited algorithm for on-line human activity recognition from simple non visual sensors, as it consistently outperforms its best non-sequential competitor, namely the Random Forest (RF). While a CRF has already been reported as outperforming a HMM in human activity classification tasks [81,82], the competition between CRF and RF has not been previously reported, mostly because these models do not belong to the same type.

The difference in performance between CRF and the HMM/MLN is due to their discriminative or generative natures. While a CRF is trained to maximize the likelihood over the whole dataset, the HMM/MLN are trained by maximizing the likelihood of each class independently. Thus, a CRF biases its learning towards the most dominant classes, as do the non-sequential discriminative schemes (SVM, Random Forest, Naïve MLN). On the contrary, the MLN and HMM model the classes independently, and this explains why they performed better on some activities. For instance, MLN had the best performance for recognizing the phone activity for 3 out of 4 conditions (cf. Tables 6–9), while the HMM showed the best general performance by class for the HIS dataset (cf. Tables 8 and 9).

The inclusion of the Unknown class decreased all the performances of all the models. However, it is the generative approaches that are the most visibly affected. Indeed the MLN and HMM exhibit the most notable decrease in performance, and this is due to their inability to model the Unknown class. Since, in our case, the Unknown class was not an “other” class, because it contained instances very similar to the classes to be found (transitions between activities and activities of no interest to the study), the SVM also showed difficulties in finding hyperplanes separating the Unknown class from the others. In the HIS dataset, since the Unknown class represented more than 18% of all examples, most of the models were highly perturbed. For instance, the Random Forest, being composed of decision trees, tried to generate a set of trees leading to leaves of consistent classes. However, due to the diverse nature of the instances of the Unknown class, generating consistent leaves became very hard. Conversely, the CRF, by considering all the classes, captured complex dependencies in the feature window to give better classification of Unknown instances.

The two datasets used in this study, though being of the same nature and comparable, were not acquired with the same participants or in the same smart home. But the most prominent difference between them is the amount of information each of them provides. The HIS dataset is far less informative than the multimodal Sweet-Home one, due to a lower number of sensors. That explains why the non-sequential models were so competitive in the Sweet-Home case, being able to benefit from informative data, such as door and window activations, as well as temporal data (the previously occupied room). In the HIS case, the number of features being smaller, the non-sequential models exhibited the lowest performance, while the sequential models (HMM, CRF, MLN) stayed competitive. While classical sequential models such as HMM and CRF benefited from the history described within the sequence, the MLN-based approaches took advantage of their high expressibility. For instance the MLN induced the following rules:

Sweet-Home dataset:1.66047 percentageagitationroom(window,HIGH) -> class(window,PHONE) 1.13709 speech_studywindow(window,HIGH) -> class(window,PHONE) -1.24098 totalagitation(window,LOW) -> class(window,PHONE) -0.175539 previous(window1,window2) and class(window1,EATING) -> class(window2,PHONE)

HIS dataset:1.50854 percentageoc_livingroom(window2,MEDIUM) -> class(window,PHONE) 1.25924 timesinceinthisroom(window2,LOW) -> class(window,PHONE) -1.00466 timesinceinthisroom(window2,HIGH) -> class(window,PHONE) 0.92 previous(window1,window2) and class(window1,PHONE) -> class(window2,PHONE)

where the head of each rule

class (w, c)

takes w, the current window and predicts c, the class of the window. Each predicate in the body indicates the value of a feature in the window. For instance speech_studywindow(window,HIGH) indicates that there was a high amount of speech detected in the study room. Positive values to the left side of each rule indicate that, when the rule is fired, it adds confidence to the class, while a negative value shows that this rule removes credence from the class. For the particular class “Phone”, MLN was able to translate the fact that when someone is talking a lot close to the phone for a short time, then s/he is most likely phoning (recall we assume a single person occupation hypothesis). Thus, despite the relative low performance of the MLN, this model seems to be a good candidate to represent higher-level activities such as iADL [43] as it is able to express complex semantic relations that purely probabilistic models cannot express.

Regarding speech/non-speech audio information, the results of the feature selection performed on both datasets during the tuning phase suggested that the most important features for activity recognition were those related to the location of the inhabitant followed by those related to speech/non-speech sound occurrences. This can also explain the role of acoustic information on the final accuracy. Even when the most important aspect was the location of the inhabitant since all the activities were performed in at most two rooms, it was also difficult to disambiguate two different activities that took place in the same room. The total agitation in a room, which was highly dependent on the number of sound events, was helpful to differentiate between eating and cleaning, both performed in the kitchen. In this particular case, the agitation produced by room doors and windows were very similar, however it was the number of sounds which helped classifiers to differentiate the activities. Likewise, reading and communication, when both performed in the study, had similar settings on door contacts and light states, but the number of speech events detected was informative enough for good classification. Also, in the MLN model, the weights of the rules relating acoustic information to some activities were large when the association was relevant, as in the following examples:

-1.572 percentagetime_sound(win,LOW) -> class(win,EATING) 1.095 totalagitation(win,LOW) -> class(win,READING) 1.148 totalagitation(win,MEDIUM) -> class(win,PHONE)

In this example, the first rule indicates that an eating activity is unlikely to generate a low amount of sound. The second rule expresses the fact that a reading activity is likely to generate a low amount of sound while the third rule shows that a phoning activity is expected to generate many sound events. These rules are further evidence that audio information is important for activity recognition.

6. Conclusion and future work

The study presented in this paper brings the following contributions:

The paper presents a complete framework for on-line AR, making it possible to summarise asynchronous as well as continuous sampled signals into temporal windows.

This framework has been evaluated on two smart home datasets, available to the community [27,80], integrating acoustic information, a kind of information which has been rarely included in previous studies of the domain. This evaluation shows the interest of these acoustic features for AR since they relate to the agitation level of the occupants (noise) as well as their social interactions (speech).

The AR task in the framework has been implemented with different sequential and instance-based models. This includes a recent model for AR – the Markov Logic Network – in both sequential and non-sequential versions. The evaluation exhibited strengths and weaknesses of each of the models for the AR task.

The models were evaluated on the datasets mentioned above in an realistic way since windows of unknown class are fed to the classifiers. Moreover, to avoid overfitting, a cross-validation technique was designed so as to exclude from the learning set one of the participant records used for testing.

Overall, Conditional Random Fields (CRF) are very competitive for on-line activity recognition from non-visual, audio and home automation sensors. Even though non-sequential models such as Random Forests show good performance on some datasets, the CRF approach is more robust to the presence of activities of Unknown class, since it showed the least decrease in performance between the datasets with and without the Unknown activities included. The performance of each method was assessed in a Leave-One-Subject-Out-Cross-Validation (LOSOCV), so that no data from the same participant was used both in the training and testing sets in the same trial. Hence, the genericity of the model was not biased by the presence of data relating to the participant being used in that test.

Although the CRF has the best performance overall, generative models such as the HMM and the MLN also perform well. These models show interesting features as they are able to model each class independently, and thus do not bias their learning towards the largest class. Moreover, the Markov Logic Network approach (MLN) is a statistical-relational model, and so its logical structure could be learned in conjunction with a priori knowledge provided by expert rules, so that the model can benefit from highly expressive previous knowledge whilst also being able to handle uncertainty.

The results presented in this study are based on two different datasets that have not been acquired in the same environment and that work with different sensors. Although the datasets are different in terms of activities considered, sensors and quantity of data, it has been shown that the trend in the results from the two sets are similar.

Finally, it has to be noted that, for some of the results, the standard deviation show a huge variability between subjects. Some of the activities were represented with few samples and the difference between subjects is then more predominant. For future models, a generic model that adapts to a participant with the first samples would be a very good direction of research.

We plan to extend our work in two directions. On the one hand, we would like to compare a window-based approach, which loses semantics and temporality but summarises the data well, against an event/state-based approach, which keeps semantic and time information but necessitates the use of even more robust models to handle errors in the data stream. It would be interesting to study the behaviour of the CRF and MLN approaches in these two cases. On the other hand, one of the main problems in human activity learning is the lack of annotated data. Indeed, in-lab recording of scenarios allow an accurate annotation with many participants, but it is not the case for real-world data. Field experiments in real homes do provide more realistic data but annotation is often performed by the participants themselves and cannot easily be verified [7]. Moreover, it is difficult to recruit participants who would be willing to have surveillance technology set up in their own home for experimental purposes. Besides, collecting real-world data is highly expensive in terms of time and resources. This is why we intend to use learning methods that either deal with partially labelled data [75] or use a Universal Background Model [64] so that a large amount of data, of which only a small portion is annotated, can be used for classification.

Footnotes

Acknowledgements

This work is a par of the Sweet-Home project supported by the Agence Nationale de la Recherche (ANR-09-VERS-011). Many thanks to the anonymous reviewers for their very constructive comments that helped improve the manuscript.

References

Aggarwal and

Ryoo, Human activity analysis: A review, ACM Computing Surveys43(3) (2011), 16:1–16:43. doi:10.1145/1922649.1922653.

Artikis,

Paliouras,

Portet and

Skarlatidis, Logic-based representation, reasoning and machine learning for event recognition, in: The 4th ACM International Conference on Distributed Event-Based Systems (DEBS), 2010, pp. 282–293.

Artikis,

Skarlatidis and

Paliouras, Behaviour recognition from video content: A logic programming approach, International Journal on Artificial Intelligence Tools19(2) (2010), 193–209. doi:10.1142/S021821301000011X.

Artikis,

Skarlatidis,

Portet and

Paliouras, Logic-based event recognition, Knowledge Engineering Review27(4) (2012), 469–506. doi:10.1017/S0269888912000264.

J.C.

Augusto and

C.D.

Nugent, The use of temporal reasoning and management of complex events in smart homes, in: European Conference on Artificial Intelligence, 2004, pp. 778–782.

Bao and

Intille, Activity recognition from user-annotated acceleration data, in: Pervasive Computing, Lecture Notes in Computer Science, Vol. 3001, Springer, 2004, pp. 1–17. doi:10.1007/978-3-540-24646-6_1.

Blachon,

Portet,

Besacier and

Tassart, RecordMe: A smartphone application for experimental collections of large amount of data respecting volunteer’s privacy, in: UCAmI 2014, Belfast, United Kingdom, 2014, pp. 345–348.

B.E.

Boser,

I.M.

Guyon and

V.N.

Vapnik, A training algorithm for optimal margin classifiers, in: Fifth Annual Workshop on Computational Learning Theory, ACM, Pittsburgh, 1992, pp. 144–152. doi:10.1145/130385.130401.

Bouakaz,

Vacher,

M.-E.

Bobillier-Chaumon,

Aman,

Bekkadja,

Portet,

Guillou,

Rossato,

Desserée,

Traineau,

J.-P.

Vimon and

Chevalier, CIRDO: Smart companion for helping elderly to live at home for longer, Innovation and Research in BioMedical Engineering (IRBM)35(2) (2014), 101–108.

10.

Breiman, Random forests, Machine Learning45(1) (2001), 5–32. doi:10.1023/A:1010933404324.

11.

Chahuara,

Fleury,

Vacher and

Portet, Méthodes SVM et MLN pour la reconnaissance automatique d’activités humaines dans les habitats perceptifs: Tests et perspectives, in: Actes de la Conférence RFIA 2012, Lyon, France, 2012, pp. 340–347.

12.

Chahuara,

Portet and

Vacher, Location of an inhabitant for domotic assistance through fusion of audio and non-visual data, in: Pervasive Health, Dublin, Ireland, European Alliance for Innovation, 2011, pp. 1–4, http://www.pervasivehealth.org/.

13.

Chahuara,

Portet and

Vacher, Making context aware decision from uncertain information in a Smart Home: A Markov Logic Network approach, in: Ambient Intelligence, Dublin, Ireland, Lecture Notes in Computer Science, Vol. 8309, Springer, 2013, pp. 78–93. doi:10.1007/978-3-319-03647-2_6.

14.

Chen,

Hoey,

Nugent,

Cook and

Yu, Sensor-based activity recognition, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews42(6) (2012), 790–808. doi:10.1109/TSMCC.2012.2198883.

15.

Chen,

Nugent,

Mulvenna,

Finlay,

Hong and

Poland, A logical framework for behaviour reasoning and assistance in a smart home, International Journal of Assistive Robotics and Mechatronics9(4) (2008).

16.

Chen,

Nugent and

Wang, A knowledge-driven approach to activity recognition in smart homes, IEEE Transactions on Knowledge and Data Engineering24(6) (2012), 961–974. doi:10.1109/TKDE.2011.51.

17.

Chieu,

Lee and

Kaelbling, Activity recognition from physiological data using conditional random fields, in: Proc. of the Singapore-MIT Alliance Symposium (SMA), 2006.

18.

Clarkson, Life patterns: Structure from wearable sensors, PhD thesis, Massachusetts Institute of Technology, USA, 2003.

19.

D.J.

Cook,

Youngblood,

E.O.

Heierman III,

Gopalratnam,

Rao,

Litvin and

Khawaja, Mavhome: An agent-based smart home, in: IEEE International Conference on Pervasive Computing and Communications, 2003, p. 521.

20.

Coutaz,

J.L.

Crowley,

Dobson and

Garlan, Context is key, Communications of the ACM48(3) (2005), 49–53. doi:10.1145/1047671.1047703.

21.

Dalal,

Alwan,

Seifrafi,

Kell and

Brown, A rule-based approach to the analysis of elders activity data: Detection of health and possible emergency conditions, in: AAAI Fall 2005 Symposium, 2005.

22.

de Carolis and

Cozzolongo, C@sa: Intelligent home control and simulation, in: International Conference on Computational Intelligence (ICCI), Istanbul, Turkey, 2004, pp. 462–465.

23.

Duong,

Phung,

Bui and

Venkatesh, Efficient duration and hierarchical modeling for human activity recognition, Artificial Intelligence173(7–8) (2009), 830–856. doi:10.1016/j.artint.2008.12.005.

24.

J.A.

Flanagan,

Mantyjarvi and

Himberg, Unsupervised clustering of symbol strings and context recognition, in: Proc. of the 2002 IEEE International Conference on Data Mining, ICDM ’02, IEEE Computer Society, Washington, DC, USA, 2002, pp. 171–178.

25.

Fleury,

Noury and

Vacher, Improving supervised classification of activities of daily living using prior knowledge, International Journal of E-Health and Medical Communications (IJEHMC)2(1) (2011), 17–34. doi:10.4018/jehmc.2011010102.

26.

Fleury,

Vacher and

Noury, SVM-based multi-modal classification of activities of daily living in health smart homes: Sensors, algorithms and first experimental results, IEEE Transactions on Information Technology in Biomedicine14(2) (2010), 274–283. doi:10.1109/TITB.2009.2037317.

27.

Fleury,

Vacher,

Portet,

Chahuara and

Noury, A French corpus of audio and multimodal interactions in a health smart home, Journal on Multimodal User Interfaces7(1) (2012), 93–109.

28.

Gallissot,

Caelen,

Jambon and

Meillon, Une plate-forme usage pour l’intégration de l’informatique ambiante dans l’habitat, Domus. Technique et Science Informatiques (TSI)32(5) (2013), 547–574.

29.

Getoor and

Taskar (eds), Introduction to Statistical Relational Learning, The MIT Press, 2007.

30.

K.Z.

Haigh and

Yanco, Automation as caregiver: A survey of issues and technologies, in: Proc. of the AAAI-02 Workshop Automation as Caregiver: The Role of Intelligent Technology in Elder Care, 2002, pp. 39–53.

31.

Hamid, A computational framework for unsupervised analysis of everyday human activities, PhD thesis, Georgia Institute of Technology, USA, 2008.

32.

Helaoui,

Niepert and

Stuckenschmidt, A statistical-relational activity recognition framework for ambient assisted living systems, in: Ambient Intelligence and Future Trends-International Symposium on Ambient Intelligence (ISAmI 2010), Guimarães, Portugal, 2010, pp. 247–254. doi:10.1007/978-3-642-13268-1_34.

33.

Helaoui,

Niepert and

Stuckenschmidt, Recognizing interleaved and concurrent activities: A statistical-relational approach, in: International Conference on Pervasive Computing and Communications (PerCom), 2011, pp. 1–9.

34.

Z.Z.

Htike,

Egerton and

Y.C.

Kuang, Real-time human activity recognition using external and internal spatial features, in: Sixth International Conference on Intelligent Environments, Kuala Lumpur, Malaysia, IEEE Computer Society, 2010, pp. 52–57.

35.

S.S.

Intille, Designing a home of the future, IEEE Pervasive Computing1(2) (2002), 76–82. doi:10.1109/MPRV.2002.1012340.

36.

Istrate,

Castelli,

Vacher,

Besacier and

J.-F.

Serignat, Information extraction from sound for medical telemonitoring, IEEE Transactions on Information Technology in Biomedicine10(2) (2006), 264–274. doi:10.1109/TITB.2005.859889.

37.

Katz, Assessing self-maintenance: Activities of daily living, mobility, and instrumental activities of daily living, Journal of the American Geriatrics Society31(12) (1983), 721–727. doi:10.1111/j.1532-5415.1983.tb03391.x.

38.

Kersting,

L.D.

Raedt and

Raiko, Logical hidden Markov models, Journal of Artificial Intelligence Research25 (2006), 425–456.

39.

Khan,

Y.-K.

Lee,

Lee and

T.-S.

Kim, A tri-axial accelerometer-based physical-activity recognition via augmented-signal features and a hierarchical recognizer, IEEE Transactions on Information Technology in Biomedicine14(5) (2010), 1166–1172. doi:10.1109/TITB.2010.2051955.

40.

N.C.

Krishnan and

D.J.

Cook, Activity recognition on streaming sensor data, Pervasive and Mobile Computing10 (2014), 138–154. doi:10.1016/j.pmcj.2012.07.003.

41.

L.A.

Kurgan and

K.J.

Cios, CAIM discretization algorithm, IEEE Transactions on Knowledge and Data Engineering16(2) (2004), 145–153. doi:10.1109/TKDE.2004.1269594.

42.

J.D.

Lafferty,

McCallum and

F.C.N.

Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in: International Conference on Machine Learning, ICML’01, 2001, pp. 282–289.

43.

Lawton and

Brody, Assessment of older people: Self-maintaining and instrumental activities of daily living, Gerontologist9 (1969), 179–186.

44.

Le Bellego,

Noury,

Virone,

Mousseau and

Demongeot, A model for the measurement of patient activity in a hospital suite, IEEE Transactions on Information Technologies in Biomedicine10(1) (2006), 92–99. doi:10.1109/TITB.2005.856855.

45.

M.-W.

Lee,

A.M.

Khan and

T.-S.

Kim, A single tri-axial accelerometer-based real-time personal life log system capable of human activity recognition and exercise information generation, Personal Ubiquitous Computing15(8) (2011), 887–898. doi:10.1007/s00779-011-0403-3.

46.

Liao, Location-based activity recognition, PhD thesis, University of Washington, USA, 2006.

47.

Lin,

M.-T.

Sun,

Poovendran and

Zhang, Activity recognition using a combination of category components and local models for video surveillance, IEEE Transactions on Circuits and Systems for Video Technology18(8) (2008), 1128–1139. doi:10.1109/TCSVT.2008.927111.

48.

Lowd and

Domingos, Efficient weight learning for Markov logic networks, in: Proc. of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases, 2007, pp. 200–211.

49.

Minnen, Unsupervised discovery of activity primitives from multivariate sensor data, PhD thesis, Georgia Institute of Technology, USA, 2008.

50.

L.C.

Molina,

Belanche and

À.

Nebot, Feature selection algorithms: A survey and experimental evaluation, in: Proc. of the IEEE International Conference on Data Mining (ICDM 2002), 2002, pp. 306–313.

51.

M.C.

Mozer, Lessons from an adaptive home, in: Smart Environments: Technologies, Protocols, and Applications,

Cook and

Das, eds, John Wiley & Sons, Inc., 2005, pp. 271–294. doi:10.1002/047168659X.ch12.

52.

Naeem and

Bigham, Activity recognition using a hierarchical framework, in: Second Int. Conf. on Pervasive Computing Technologies for Healthcare, 2008, pp. 24–27.

53.

Natarajan,

H.H.

Bui,

Tadepalli,

Kersting and

Wong, Logical hierarchical hidden Markov models for modeling user activities, in: Proc. of the 18th International Conference on Inductive Logic Programming, Prague, Czech Republic, Springer-Verlag, 2008, pp. 192–209. doi:10.1007/978-3-540-85928-4_17.

54.

Nazerfard,

Das,

L.B.

Holder and

D.J.

Cook, Conditional random fields for activity recognition in smart environments, in: Proc. of the 1st ACM International Health Informatics Symposium, IHI ’10, ACM, New York, NY, USA, 2010, pp. 282–286.

55.

Q.N.

Ni,

García Hernando and

I.P.

de la Cruz, The elderly’s independent living in smart homes: A characterization of activities and sensing infrastructure survey to facilitate services development, Sensors15 (2015), 11312–11362. doi:10.3390/s150511312.

56.

Okeyo,

Chen,

Wang and

Sterritt, Dynamic sensor data segmentation for real-time knowledge-driven activity recognition, Pervasive and Mobile Computing (2012), (in press).

57.

D.J.

Patterson,

Etzioni,

Fox and

Kautz, Intelligent ubiquitous computing to support Alzheimer’s patients: Enabling the cognitively disabled, in: Fourth International Conference on Ubiquitous Computing, Göteborg, Sweden, 2002, pp. 21–22.

58.

Pentney,

A.-M.

Popescu,

Wang,

Kautz and

Philipose, Sensor-based understanding of daily life via large-scale use of common sense, in: Proc. of the 21st National Conference on Artificial Intelligence – Volume 1, AAAI’06, AAAI Press, 2006, pp. 906–912.

59.

Portet,

Vacher,

Golanski,

Roux and

Meillon, Design and evaluation of a smart home voice interface for the elderly: Acceptability and objection aspects, Personal and Ubiquitous Computing17 (2013), 127–144. doi:10.1007/s00779-011-0470-5.

60.

J.R.

Quinlan, Bagging, boosting, and C4.5, in: Proc. of the 13th National Conference on Artificial Intelligence (AAAI-96), AAAI/MIT Press, 1996, pp. 725–730.

61.

Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings of the IEEE77(2) (1989), 257–286. doi:10.1109/5.18626.

62.

Rasanen, Hierarchical unsupervised discovery of user context from multivariate sensory data, in: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 2105–2108. doi:10.1109/ICASSP.2012.6288326.

63.

Rashidi and

Mihailidis, A survey on ambient-assisted living tools for older adults, IEEE Journal of Biomedical and Health Informatics17(3) (2013), 579–590. doi:10.1109/JBHI.2012.2234129.

64.

D.A.

Reynolds,

T.F.

Quatieri and

R.B.

Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing10(1–3) (2000), 19–41. doi:10.1006/dspr.1999.0361.

65.

Rialle, Rapport sur les technologies nouvelles susceptibles d’améliorer les pratiques gérontologiques et la vie quotidienne des malades âgés et de leur famille, Technical report, Rapport remis à M. Philippe Bas, Ministre de la Santé et des Solidarités, République Française, p. 74.

66.

Richardson and

Domingos, Markov logic networks, Machine Learning62(1–2) (2006), 107–136. doi:10.1007/s10994-006-5833-1.

67.

Rota and

Thonnat, Activity recognition from video sequences using declarative models, in: European Conference on Ambient Intelligence (ECAI),

Horn, ed., IOS Press, 2000, pp. 673–680.

68.

Saeys,

Inza and

Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics23 (2007), 2507–2517. doi:10.1093/bioinformatics/btm344.

69.

Segal, Machine learning benchmarks and random forest regression, Technical report, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA, USA, 2004.

70.

Storf,

Becker and

Riedl, Rule-based activity recognition framework: Challenges, technique and learning, in: 3rd International Conference on Pervasive Computing Technologies for Healthcare, PervasiveHealth 2009, London, UK, 2009, pp. 1–7.

71.

Tang and

Venables, Smart homes and telecare for independent living, Journal of Telemedicine and Telecare6(1) (2000), 8–14. doi:10.1258/1357633001933871.

72.

E.M.

Tapia,

S.S.

Intille and

Larson, Activity recognition in the home using simple and ubiquitous sensors, Pervasive Computing2 (2004), 158–175. doi:10.1007/978-3-540-24646-6_10.

73.

Taskar,

Abbeel and

Koller, Discriminative probabilistic models for relational data, in: Proc. of the Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002, pp. 485–492.

74.

Tong and

Chen, Latent-dynamic conditional random fields for recognizing activities in Smart Homes, Journal of Ambient Intelligence and Smart Environments6 (2014), 39–55.

75.

Truyen,

Bui,

Phung and

Venkatesh, Learning discriminative sequence models from partially labelled data for activity recognition, in: PRICAI 2008: Trends in Artificial Intelligence,

T.-B.

Ho and

Z.-H.

Zhou, eds, Lecture Notes in Computer Science, Vol. 5351, Springer, Berlin, Heidelberg, 2008, pp. 903–912. doi:10.1007/978-3-540-89197-0_84.

76.

Vacher,

Caffiau,

Portet,

Meillon,

Roux,

Elias,

Lecouteux and

Chahuara, Evaluation of a context-aware voice interface for Ambient Assisted Living: Qualitative user study vs. quantitative system evaluation, ACM Transactions on Accessible Computing7(2) (2015), 1–36.

77.

Vacher,

Chahuara,

Lecouteux,

Istrate,

Portet,

Joubert,

Sehili,

Meillon,

Bonnefond,

Fabre,

Roux and

Caffiau, The SWEET-HOME project: Audio technology in Smart Homes to improve well-being and reliance, in: 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’13), Osaka, Japan, 2013, pp. 7298–7301.

78.

Vacher,

Fleury,

Portet,

J.-F.

Serignat and

Noury, Complete sound and speech recognition system for health Smart Homes: Application to the recognition of activities of daily living, in: New Developments in Biomedical Engineering,

Campolo, ed., 2010, pp. 645–673, In-Tech.978-953-7619-57-2.

79.

Vacher,

Fleury,

J.-F.

Serignat,

Noury and

Glasson, Preliminary evaluation of speech/sound recognition for telemedicine application in a real environment, in: Interspeech’08, Brisbane, Australia, 2008, pp. 496–499, 4p.

80.

Vacher,

Lecouteux,

Chahuara,

Portet,

Meillon and

Bonnefond, The sweet-home speech and multimodal corpus for home automation interaction, in: The 9 Edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014, pp. 4499–4506.

81.

D.L.

Vail,

M.M.

Veloso and

J.D.

Lafferty, Conditional random fields for activity recognition, in: Proc. of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’07, New York, NY, USA, 2007, pp. 235:1–235:8.

82.

van Kasteren,

Englebienne and

Kröse, An activity monitoring system for elderly care using generative and discriminative models, Personal and Ubiquitous Computing14(6) (2010), 489–498. doi:10.1007/s00779-009-0277-9.

83.

T.L.M.

van Kasteren,

Englebienne and

B.J.A.

Kröse, Hierarchical activity recognition using automatically clustered actions, in: Proc. of the Second International Conference on Ambient Intelligence, AmI’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 82–91.

84.

van Kasteren and

Krose, Bayesian activity recognition in residence for elders, in: 3rd IET International Conference on Intelligent Environments, Ulm, Germany, 2007, pp. 209–212.

85.

Velik, A brain-inspired multimodal data mining approach for human activity recognition in elderly homes, Journal of Ambient Intelligence and Smart Environments6(4) (2014), 447–468.

86.

Wang and

Ji, Learning dynamic Bayesian network discriminatively for human activity recognition, in: 21st International Conference on Pattern Recognition (ICPR 2012), 2012, pp. 3553–3556.

87.

Zappi,

Lombriser,

Stiefmeier,

Farella,

Roggen,

Benini and

Tröster, Activity recognition from on-body sensors: Accuracy-power trade-off by dynamic sensor selection, in: European Conference on Wireless Sensor Networks,

Verdone, ed., Lecture Notes in Computer Science, Vol. 4913, Springer, 2008, pp. 17–33. doi:10.1007/978-3-540-77690-1_2.

88.

Zouba,

Bremond,

Thonnat,

Anfosso,

Pascual,

Mallea,

Mailland and

Guerin, A computer system to monitor older adults at home: Preliminary results, Gerontechnology Journal8(3) (2009), 129–139.

On-line human activity recognition from audio and home automation sensors: Comparison of sequential and non-sequential models in realistic Smart Homes 1

Abstract

Keywords

1. Introduction

2 As for any technology, video cameras can be very well accepted if the benefit is perceived to be higher than the feeling of intrusion.

3. Method

3.1.1. Windowing strategy

3.1.2. Localisation and speech/non-speech sound detection

3.1.3. Computed features

3 In this paper, we distinguish the activity – i.e., the task being performed – from the activeness – i.e., the state of being active. It is also called ‘total agitation’ in the paper from the French agitation equivalent to bustle in English.

3.3. Activity modelling by Hidden Markov Model (HMM)

4. Experiments and results

4.1. Pervasive environments

4.1.1. The Domus Smart Home

4.2.1. The multimodal Sweet-Home (SH) dataset

4 liris.cnrs.fr/advene.

4.2.3. Implementation for evaluation purpose

4.3.1. Attribute selection

4.4. Results

5 Here ‘record’ means the full record for a single participant.

4.4.3. Global results

4.4.5. Results on HIS dataset

6 In these tables are also given sensitivity and specificity. As a reminder, let’s consider TP the True Positive rate, TN the True Negative rate, FP the False Positive rate and FN the False Negative rate. Sensitivity = T P T P + F N and Sensitivity = T N T N + F P .

5. Discussion

6. Conclusion and future work

Footnotes

Acknowledgements

References

²
As for any technology, video cameras can be very well accepted if the benefit is perceived to be higher than the feeling of intrusion.

³
In this paper, we distinguish the activity – i.e., the task being performed – from the activeness – i.e., the state of being active. It is also called ‘total agitation’ in the paper from the French agitation equivalent to bustle in English.

⁴
liris.cnrs.fr/advene.

⁵
Here ‘record’ means the full record for a single participant.

⁶
In these tables are also given sensitivity and specificity. As a reminder, let’s consider TP the True Positive rate, TN the True Negative rate, FP the False Positive rate and FN the False Negative rate. $Sensitivity = \frac{T P}{T P + F N}$ and $Sensitivity = \frac{T N}{T N + F P}$ .