Abstract
Over the past decade intelligent environments have grown in sophistication. Many recent paradigm shifts – such as the Internet of Things (IoT), Ambient Assisted Living (AAL), e-health and telemedicine – envision large distributed networks of intelligent devices, applications and services that are sensitive to the presence of people and responsive to their needs. Cutting edge technologies will autonomously and collectively operate on a growing volume of information arriving at ever increasing velocities to transparently and non-intrusively support users during their activities. Especially the escalating variety of information that applications have to deal with is a non-trivial concern. Making sense out of heterogeneous and pervasive streams of sensor events to anticipate and address the needs of users is a ubiquitous challenge that many interactive context-aware applications in intelligent environments frequently face. Furthermore, software solutions that continuously interpret the tasks and contexts of a variety of individuals with different needs are often faced with scalability concerns.
We present SAMURAI, a batch and streaming context architecture that integrates and exposes well-known components for complex event processing, machine learning, and knowledge representation. SAMURAI builds upon key concepts of the Lambda architecture and big data enabling technologies to achieve horizontal scalability and responsive interaction with its users. Two application cases validate the feasibility and performance of our context architecture, demonstrating near-linear scalability, flexible elasticity and smooth interaction capabilities.
Introduction
The exponential data growth is an opportunity to build sophisticated intelligent environments. With the advent of trends like big data, smart applications increasingly obtain more useful information about their users and their preferences. With data being volunteered at an unprecedented scale, context-aware computing is becoming a game changer as it allows service providers of a variety of intelligent environment applications to customize their solutions to their users and this with decisions no longer based on speculative presumptions or manually crafted rules and adaptation logic, but rather on data-driven models. Indeed, mobile and wearable computing platforms, like smartphones and smartwatches [38], embed sensor technology to observe acceleration, location, orientation, ambient lighting, sound, imagery [24]. Especially, in the field of m-health and e-health, the use of such wearable devices is becoming more prevalent [31,39] with mobile health applications to monitor a variety of health parameters including posture [25], heart conditions [45], diabetes [3,33], and physical activity [6]. Furthermore, emerging computing paradigms like the Internet of Things (IoT) [4] will further spark sensor technology to become omnipresent in our surroundings, and will promise a continuous data growth.
Indeed, intelligent environments are evolving to open ended large scale and dynamic network infrastructures fueled by low cost, connected and wirelessly communicating devices that collect data, relay information to one another, process the information collaboratively, and take actions in an autonomic way. However, relevant information about the user is also becoming more complex, heterogeneous, and scattered. With the increased prevalence of mobile applications, the expectations of a frictionless customer experience, and the diversity in a user’s computing environments, tapping into this exponential data growth with conventional methods has become an arduous undertaking. Given the unpredictable peaks of high computational cost to collect and process data, being able to make sense of large volumes of data with uncertain veracity and value from a variety of context sources in near real-time will become a key differentiator [19] for future intelligent environment solutions.
Sophisticated intelligent applications and services should factor in all relevant context information about the user and his situation, but are often faced with the following non-trivial questions on how to effectively unlock the large yet untapped sources of context information:
Which information will influence application-level decisions for user adaptive behavior? Is this information readily available or does it require additional processing before it is useful? Can we process this information offline in batch mode or online in a streaming fashion? What are acceptable processing latencies to guarantee a smooth interaction and user experience? Can we easily reuse the context processing components for other applications? Can our context system be distributed and scale out on demand when the workload grows? Individuals with different needs sharing the same applications, services and infrastructure Heterogeneous context data sources and event types with varying degrees of veracity Loosely structured and distributed event streams collectively adding value to the application
The above questions occur frequently across context-aware adaptive applications, and have amplified the need for context-aware computing solutions offered as services that can deal with:
Enabling technologies that provide a reusable, scalable and reliable software solution for the above concerns have all what it takes to become an indispensable foundation for a wide variety of intelligent applications and environments. We are tackling these challenges by extending and enhancing SAMURAI [34], our previous award-winning research on streaming multi-tenant context-management architecture for intelligent and scalable Internet of Things applications. The first version of this system was developed within the frame of the FP7 BUTLER project1
After reviewing related work in Section 2, we present two use cases in Section 3 as motivating examples. Section 4 discusses the design of our batch and streaming context architecture. In Section 5 we evaluate the feasibility and effectiveness of the approach. We conclude in Section 6 summarizing the main insights and identifying possible topics for future work.
Before we dive into the contributions of our work, we briefly discuss relevant state-of-the-art on activity recognition, big data enabling technologies and scalable learning and prediction algorithms.
Behavior-awareness and activity recognition
Especially in the area of home-care applications and Ambient Assisted Living [12] for the elderly, automatic discovery and classification of daily activities plays a key role [40,51] to anticipate the kind of assistance they need or to detect the occurrence of abnormal events (such as a fall or heart failure). Accelerometers [23,37] are a popular type of sensor for activity recognition and to assess physical activity. Other works in this field rely on numerous sensors to enrich observation data, and often depend on prior knowledge about the activities and the environment learned in a supervised manner.
In [46], Lin et al. present an activity recognition approach using a mobile phone. The data from 6 different test subjects is collected on a Nokia N97, and used to build an SVM-classifier. Five types of features are explored in this work, including mean, variance, correlation, FFT-energy and frequency-domain entropy. One of the main factors that can influence the recognition rate is the position where a user is carrying his device. This position can either be in the pocket near the hip, in the front pocket or just in his hand. The influence of this position on the accuracy of predictions is researched in this work.
In [20], Khan et al. focus on the fact that the activity recognition approach should work in real-time. The authors claim that frequency domain features work best, but that these require too much computation to be feasible in a real-time scenario. Benchmark testing is carried out with one specific accelerometer. While not further elaborated on, the authors note that the specific set of features used makes the approach more person dependent. In their multiple-subject scenario, data from multiple test subjects is used to train a classifier. They use this input from multiple subjects irrespective of the physique of the persons, but do note that adding more subjects decreases the performance.
Machine learning techniques are frequently used to model a wide range of human activities and to elicit particular patterns of interest. Some techniques investigate to what extend such models can be learned across different users [8] to address the concern of activity recognition algorithms requiring substantial amounts of labeled training data. However, a detailed discussion on human activity recognition is beyond the scope of this work. We refer interested readers to recent surveys [1,7] for a more elaborate and in-depth overview of human activity recognition research.
Batch and stream processing on big data
The data explosion of the Internet of Things is often linked with the Big Data paradigm. MapReduce [10] – and its Hadoop [48] implementation – is a software framework and programming model that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of computers. The shortcomings and drawbacks of batch-oriented data processing have been widely recognized as many applications are in need of real-time [5,44] and in-stream processing capabilities [9]. This concept got a lot of traction with various distributed event stream processing (ESP) engines emerging. Yahoo’s S4 [29] and Twitter’s Storm project [26,47] were among the first to attract a lot of attention. Also Google acknowledged the limitations of MapReduce, with its MillWheel [2] framework and programming model dedicated to fault-tolerant stream processing at Internet scale. Spark [49] is another state-of-practice software solution for large-scale data processing. It runs up to 100 times faster in memory than Hadoop MapReduce and supports scalable fault-tolerant streaming applications. Spark Streaming [50] builds upon the Spark’s foundations to build big data application that act on data in real time. Apache Samza2
The growing amounts of data has increased the interest in implementing machine learning algorithms on top of the MapReduce framework [14,15]. Mahout [30], a machine learning and data mining framework built on top of Hadoop, addresses this scalability challenge. MLbase [22] builds upon the Spark framework and more particularly the MLI [43] API and Spark’s MLlib3
Stream mining [13] differs from more traditional data mining libraries like Weka [17] and statistical learning tools like R [18] in the sense that they focus on extracting knowledge in non-stopping streams of events. Massive Online Analysis (MOA) [21] is such a software framework that offers a variety of algorithms and evaluation methods for supervised and unsupervised learning, supporting only single machine deployments thereby limiting its scalability. SAMOA [28] on the other hand is pluggable architecture that allows it to run on several distributed stream processing engines such as Storm [47], S4 [29], and Samza.

A mobile context-aware diabetes application as a first motivating use case.

User-friendly context-aware authentication as a second motivating use case.
In this section, we present two different use cases. A first use case deals with activity recognition in the healthcare domain. A second use case focuses on the use of context information to implement an intelligent authentication system.
Activity recognition for e-health applications
A first use case builds upon our mobile application for diabetes patients [32,35] as depicted in Fig. 1. The major challenge in this application is to identify classes of activities that have an effect on blood glucose levels. As activities of daily living (ADL) typically present recurring behavioral patterns, we explore correlations between time and location on the one hand and types of activities on the other hand, to find similar situations of the past as a recommendation for the patient. We also track the number of steps taken each day as a measure for well-being and as a means for recommendation to have a more active lifestyle.

Early version of the SAMURAI streaming context architecture.
Traditional systems for identity and access management technologies rely heavily on usernames and passwords for authentication. However, weak passwords do not offer the security guarantees for risk-sensitive services that require stronger continuous identity assurance. Furthermore, entering long and hard-to-remember passwords is deemed inconvenient for mobile customers, especially for online applications where a frictionless experience is paramount. This motivating use case originates from our earlier work [36] that investigates non-intrusive authentication techniques that can operate silently in the background based on additional context and behavioral information [41]. Continuous passive assessment of the context, as depicted in Fig. 2, enables service providers to streamline access for trusted combinations of user accounts and contexts. Related work [11] has shown that particular imperfections of an accelerometer can be used to track a smartphone. We are not pursuing this technique to explore privacy concerns, but rather as a means to recognize trusted devices.

Conceptual and technology-agnostic overview of a typical Lambda architecture.
Both motivating use cases rely at least in part on accelerometer data and behavior recognition. By adding complexity, such as machine learning and semantic reasoning techniques, we can further improve the recognition accuracy:
However, such complexity is usually too much to handle for a smartphone. Even for server-side implementations scalability remains a concern when multiple customers with different needs must be served at the same time. In the following section, we will discuss the basic primitives that we use in our framework to address these challenges.
Batch and streaming context management
The earlier version of SAMURAI – as discussed in [34] and depicted in Fig. 3 – was mainly focused on stream-based processing of context information. The old design offered publish/subscribe capabilities to have clients (applications or subsystems) notified when particular (patterns of) events occur with push notifications implemented as REST callbacks (see following section). The architecture had three basic components to hold events:
SAMURAI pursued simple methods for scaling out over multiple nodes by replicating functionality and building blocks on different machines. However, the event processing building blocks were not adequate enough to handle large amounts of data effectively. Here are some high-level reasons for why we decided to redesign our SAMURAI system:
For more effective context recognition, we wanted to train and test with specialized machine learning algorithms that would operate on all the available data. The previous design and deployment configuration could no longer handle all data effectively with Weka running on a single node, nor was it feasible to carry this out in a streaming fashion. We needed better support for batch-based distributed context processing.
For applications that required large amounts of data to be processed, the simple sharding technique we used to scale out resulted into performance problems, mainly due to network latency and bandwidth issues. We needed to take data locality into consideration and move the computation where the data is in order to avoid as many data shuffles over the network as possible.
The fairly rigid distributed deployment scheme of and coordination between multiple SAMURAI instances (as depicted in Fig. 3) was fine-tuned for a particular runtime scenario, but the deployment and configuration became less effective when the amount or the speed of information evolved. Furthermore, our solution was not able to deal with stragglers, i.e. nodes that would take an unusually long time to complete a task, potentially degrading the overall performance.
Fault tolerance becomes a more critical concern for large-scale deployments. Big data systems are designed with failure as the norm, rather than as the exception, because hardware failures or network partitions could otherwise lead to unexpected data losses. Our old solution did not have any adequate means to gracefully handle such concerns. It did incorporate data replication to account for nodes being disconnected, but events in transit could get lost and as a result not be processed correctly.
By redesigning SAMURAI, we now aim to fill this gap by offering a distributed and multi-tenant event-based batch and streaming context architecture with complex event processing, machine learning and semantic context enrichment as key capabilities.
Many of the large-scale distributed data processing systems discussed earlier (e.g. Hadoop, Spark, Storm) have shown their merit in the enterprise for business analytics applications to deal with the above non-functional concerns. In this work we explore the feasibility of such systems for large-scale context-aware applications.
Basic principles of the Lambda architecture
We redesigned our solution around key concepts of the Lambda Architecture – a term coined by Nathan Marz [27] – to achieve processing and serving of extremely high volumes of data in an efficient, scalable and fault-tolerant manner, while maintaining a responsive interaction with the user. Conceptually, the Lambda Architecture – as depicted in Fig. 4 consists of three layers: (1) the Batch layer which ingests and stores large amounts of historical data, and computes views from that data; (2) the Speed layer which ingests and processes incremental updates on that data in a low-latency streaming manner, and (3) the Serving layer that exposes precomputed views to serve ad-hoc queries with low latency.
Any new data is stored in the batch layer, but also sent to the speed layer. The data in the batch layer is ingested using periodic bulk updates with map-reduce operations that work on the entire master data set (e.g. an immutable store in HDFS). Computations in the batch layer are high-latency and may take hours to complete, and the results of these recomputations are the batch views. After each recomputation, the existing batch views are swapped with the new ones. The speed layer only deals with new data and produces real-time views that compensate for the high-latency updates of the serving layer. The incremental updates in the speed layer are low-latency and usually happen in the order of seconds. The final results for a query merges the output of both the batch and real-time views.

Example of an event type and instance.

Distributed semantic reasoning with RDF triples on top of Apache Spark.

Registering event types (lines 1–2); submit events (lines 4–5); register event statements (lines 7–11) and event listeners (lines 13–14).
Events can be simple events that carry slivers of meaning in themselves, and complex events which summarize, represent, or denote a set of single events which combined denotes a ‘pattern of events’. An event is represented as a set of typed key-value pairs that can be easily serialized into the JSON format. The example in Listing 1 illustrates the type and an instance of an accelerometer event that we use for activity recognition.
In this example, the x, y and z values hold the acceleration values along these axes. The timestamp field represents the number of milliseconds passed since January 1, 1970 UTC.
Batch and stream computation building blocks
The Lambda Architecture is technology agnostic. For each of the various stages in the data pipeline and layers, we use Apache Kafka6
We use Apache Spark in the batch layer of SAMURAI to implement distributed algorithms that have to process large volumes of data in a scalable way. One of these algorithms is semantic reasoning. For example, we have implemented a distributed RDFS engine with other extensions that reasons on RDF triples based on the following inference rules:
The description logic behind an RDFS reasoner is not as expressive as the language supported by many state-of-practice OWL2 ontology reasoners like Pellet [42] or HermiT [16], but our Spark-based implementation is much more capable of handling large amounts of data by transparently distributing the above inference rules and RDF triples over multiple worker nodes. Listing 2 shows how some of the above rules have been implemented in Spark.

Semantic representation of rooms and activities in an apartment.
When Storm is used in the speed layer, then SAMURAI can embed Esper10
Esper enables feature extraction from low-level events (e.g. from accelerometer to steps) that would otherwise require manual map-reduce implementations. Esper usually relies on Java POJOs to represent events at compile time. However, in SAMURAI new event types can be created anytime. We therefore expose a RESTful API to dynamically register new event types at runtime. Listing 3 illustrates how to do this with curl, a command-line utility commonly found on Linux systems to transfer data from or to a server. Registering the other event types and sending events can be done in a similar way as shown in the same figure.
Below is a short overview of some of the event stream processing steps for our motivating use cases:
Our system offers RESTful APIs to register statements and listeners. A statement is a continuous query registered with an Esper engine instance that provides results to listeners as new events arrive. In order for applications or subsystems to be notified about the step events, we add a listener to this statement as shown in line 13. The example adds a REST callback to

Registering a classifier (line 1–2); upload training data (line 4); custom Esper operator for classification (line 6).
Beyond matching patterns of events and feature extraction, SAMURAI can also leverage background knowledge stored in a semantic database to increase the meaningfulness of an event. Unfortunately, there are currently no mature RDF reasoning engines available that both run on top of a distributed MapReduce-like framework with support for the SPARQL query language and spatio-temporal reasoning. Instead, it uses Parliament 2.7.9,11
The benefits for adding and integrating semantic and spatio-temporal reasoning capabilities into our SAMURAI system are manifold:
Describe the spatial characteristics of different locations in your environment (see Listing 4 and Fig. 5). Use the W3C SSN ontology to describe the sensors and their position. Translate coordinates into semantic locations (e.g. [6.0, 10.0] being in the Living Room). Semantically link locations with relevant activities (e.g. Watch TV in a Living Room).
The following (simplified) statement demonstrates the integration with Esper (see Listing 6):

Custom geo-semantic event operator location().

Visualization of the apartment.
This statement translates the x and y coordinates (e.g. obtained after signal strength triangulation) of incoming events of type LocationEvent with the custom location() Esper operator offered by SAMURAI. The operator is mapped onto a GeoSPARQL query which retrieves the semantic location (e.g. location(6,10) → ‘Living Room’). Such higher level concepts are more suitable for classification.
When the relationship between co-occurrent events cannot be established in advance, we need classification and clustering mechanisms to probabilistically infer these dependencies. SAMURAI embeds the Weka [17] machine learning library for this purpose and exposes its key features through RESTful APIs. SAMURAI allows every application to register one or more models, with each model having a particular attribute set and classifier. See lines 1 and 2 in Listing 5. This example registers a model called m01 using Naive Bayes as an incremental classifier. The attributes used for classification are described in the Attribute-Relation File Format (ARFF) and registered with the following REST API (see line 4). By specifying an appropriate statement and corresponding listener, Esper feeds events as training or test instances into the Weka model. The example in lines 6-7 illustrates how to probabilistically classify activities from the current time (in hours) and location (e.g. 8, ‘Kitchen’ → ‘HavingBreakfast’). This example demonstrates the use of Weka to learn spatio-temporal correlations. The integration with Esper is again with custom Esper operations mapping the core classification and clustering features of Weka.
Our framework also leverages Spark’s MLlib12
While SAMURAI offers extensive capabilities to process information on a large scale by leveraging state-of-practice building blocks that are exposed as RESTful services, there are some non-trivial trade-offs as SAMURAI offers multiple similar building blocks with different performance capabilities.
The Weka machine learning library offers more algorithms compared to the MLlib machine learning extension of Apache Spark, but Weka does not have the same distributed processing capabilities for clustering and classification that Spark has.
A general purpose OWL reasoner like Pellet [42] or HermiT [16] allows for sophisticated semantic reasoning albeit on a single node, whereas our less expressive semantic reasoner can run on Spark in a distributed fashion.
Complex event processing can be accomplished either leveraging the Esper component or the Spark Streaming framework. Esper offers higher levels of abstraction, whereas the latter is far more easier to scale out over multiple nodes but requires a bit more programming to achieve the same objectives.
One of the shortcomings of SAMURAI is that it does not yet offer a unified abstraction layer for these technologies with common functionalities or similar algorithms. Such an abstraction layer would allow to transparently trade one implementation over the other to address performance concerns.

Monitoring dashboard.

Event stream processing performance in the speed layer of SAMURAI on a single node.
We evaluated our previous version of SAMURAI in [34]. This earlier version also leveraged the key building blocks for complex event processing, machine learning, and spatio-temporal and semantic reasoning. However, the whole architecture was not designed around the Lambda architecture, nor did it incorporate big data processing capabilities. The focus of the evaluation in this work will be on validating the feasibility and performance of our enhanced SAMURAI context architecture, demonstrating near-linear scalability, flexible elasticity and smooth interaction capabilities with the key components of our two motivating use cases.
Experimental setup
We use an experimental setup of 15 machines, each equipped with an Intel Core 2 Duo 3.00 GHz CPU and 4 GB of memory and running a 64-bit Ubuntu 14.04 operating system. All machines are linked to a 1 Gigabit network. We use an additional 5 machines to simulate different users. We refer to the former 15 machines as the internal side of the setup, whereas the 5 machines acting as load generators being the external side of the experimental setup. Figure 6 illustrates our monitoring dashboard with SAMURAI being deployed on 4 systems (called laarne, ronse, temse and tremelo).
From the 15 machines, the front-end is deployed on an Apache Tomcat 8.0.51 application server on a dedicated master node. The semantic reasoning engine of SAMURAI (i.e. Parliament 2.7.9) is deployed on a similar Tomcat instance on two other nodes in a load balanced setup. Parliament is a fully featured RDF triple store that runs on a single machine. To share the workload, the 2 nodes serve the same knowledge graphs so that the GeoSPARQL queries can be distributed and load balanced among both. For our experiments, we do not need more than 2 of such nodes for geospatial and semantic reasoning to not cause any performance bottlenecks, but more Parliament instances can be added if needed.
The 12 other nodes are set up to run the Spark and Storm worker nodes of the batch and speed layers of SAMURAI. They collectively also host the HDFS distributed file system that is used to store all master data processed by the batch layer of SAMURAI.
Feasibility assessment and validation with motivating use cases
The objective of the motivating examples of Section 3 is activity and behavior recognition – common for both use cases – based on accelerometer event streams, spatio-temporal and semantic reasoning. For the first use case, we aim to recognize different types of human motion, whereas in the second use case we continuously analyze and classify the risk for context-aware authentication.
Again, as in the previous work, the objective of the evaluation is not to assess the effectiveness of the recognition, but the scalability of the approach for a large user base. The 5 machines running the load generator that simulates user behavior produces about 50 events per second per user on average as input for SAMURAI. The input constitutes mainly accelerometer and gyroscope events, location updates, wifi network details, and mobile device fingerprints.
Without going into details, these events are processed by batch and speed layers, the complex event processing component, the semantic reasoning engine and machine learning classification algorithms. The batch layer is used periodically offline to train the different classifiers and decision trees which are used at runtime in the speed layer.

Scaling out to multiple worker nodes.
A first experiment particularly focuses on processing the accelerometer events using the Spark Streaming speed layer on a single worker node. The accelerometer runs at about 40 Hz for a single user. This means 10000 event updates per second when simulating about 250 users. The results are shown in Fig. 7. Simulating more users would cause the accelerometer events to no longer be processed in less than a second, i.e. in a real-time streaming fashion.
For a second experiment, we processed all the events in terms of a growing number of concurrent users in a setup scaling out up to 12 worker nodes. Figure 8 illustrates the horizontal scalability.
For both experiments, we assume that the batch layer has already learned the classifiers and decision trees based on all historic data. This step was completed offline in less than 1 hour using all the worker nodes. While our experiments were not carried out on a very large number of nodes (e.g. more than 100), our feasibility assessment does show that SAMURAI exhibits near-linear horizontal scalability. We have identified some performance concerns with Parliament, the GeoSPARQL semantic storage and reasoning backends hosted on the dedicated nodes. As mentioned earlier, Parliament does not support distributed reasoning. We are addressing this concern by implementing distributed RDFS reasoning with map() and reduce() operations, but it currently does not yet offer the same semantic and spatio-temporal reasoning capabilities that Parliament offers. If such functionality would be available, then all of SAMURAI’s features would be running inside the worker node cluster of the batch and speed layers, fully supporting elastic scalability.
SAMURAI offers a variety of features that are – to the best of our knowledge – not available in any other contemporary framework. To systematically evaluate SAMURAI and compare with systems that share some functionality, there is a need for realistic and commonly accepted benchmarks as well as tool support to generate reproduceable workloads that grow in size and complexity. A side-by-side quantitative comparison of existing systems is therefore not trivial.
However, a quantitative analysis regarding the choice of distributed computing technology for the different layers in SAMURAI is feasible. We tested the batch layer with simple data clustering and outlier detection tasks that were implemented on top of the Hadoop and Spark frameworks. For this particular experiment, Spark far outperformed Hadoop for batch processing tasks with at least an order of magnitude. The main reason for this is that Hadoop saves intermediary processing results on disk whereas Spark is memory oriented. Regarding the speed layer, Apache Spark Streaming cannot achieve the same low latency properties that Storm can. This limits the capabilities of SAMURAI to those use cases that do not require end-to-end data processing latencies below 1 second. However, we believe that these drawbacks do not outweigh the main benefits of having a single Spark API for both the batch and speed layers versus having different implementations for Hadoop and Storm.
Conclusion
We presented and evaluated our redesigned version of SAMURAI – our award winning batch and streaming context architecture – that integrates and exposes well-known components for complex event processing (feature extraction, information fusion, notification), machine learning (learn co-occurrences of events and spatio-temporal correlations), and knowledge representation (linking positions with semantic locations and activities).
We redesigned SAMURAI around key concepts of the Lambda architecture – with a batch, speed and service layer – and leveraged big data enabling technologies to achieve horizontal scalability and responsive interaction with its users. Key reasons for the redesign were (1) better support for distributed batch processing on large amounts of data, especially for machine learning tasks, (2) avoid network capacity concerns by taking data locality into consideration, (3) enable dynamic scheduling of distributed tasks to handle stragglers, and (4) consider failure as the norm rather than as the exception by leveraging fault tolerance capabilities of big data processing subsystems.
Two application cases in the healthcare domain and context-aware authentication were used to validate the feasibility and performance of our redesigned SAMURAI context architecture. The experimental evaluation demonstrated near-linear scalability, though a current limitation is still the fact that the semantic and spatio-temporal reasoner is not fully distributed on top of the big data frameworks. In our current setup, we replicate multiple instances of this component in a load balanced configuration to distribute the workload, but this is less effective compared to being able to execute such tasks on multiple nodes in parallel.
In our evaluation, we used the Apache Spark framework and its Streaming and MLlib extensions to implement the batch and speed layers. There are preliminary integrations of other big data frameworks in SAMURAI, especially Apache Storm for the speed layer. However, a side-by-side performance comparison has not yet been carried out because the current implementation of the motivating use cases is tied too much to the underlying big data processing technology. For example, a new implementation of the use cases would be required to make use of the Storm streaming backend in the speed layer. As part of future work, we will explore to what extend we can unify the implementation in a similar way as our Spark prototype where reuse of application code across the batch and speed layer is much more straightforward because these layers rely on the same technology and similar APIs.
Also as future work, we will evaluate SAMURAI’s support for fault tolerance and quantify the impact of stragglers. One of the reasons we did not explore this in depth at this stage is because the big data software systems we integrated have different strategies for fault tolerance. Some rely on data replication, whereas others rely on snapshots and the lineage of the data life cycle to recompute lost data or partitions. A systematic comparison would require an adequate testing framework that can introduce realistic faults and performance bottlenecks into the distributed system, both at the network level as well as in the worker nodes.
Footnotes
Acknowledgement
This research is partially funded by the Research Fund KU Leuven.
