Aegis : Reliable Application Execution Over the ... - ScienceDirect.com

Available Available online online at at www.sciencedirect.com www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science (2016)482–489 000–000 Procedia Computer Science 109C00 Procedia Computer Science 00(2017) (2016) 000–000

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

The The 8th 8th International International Conference Conference on on Ambient Ambient Systems, Systems, Networks Networks and and Technologies Technologies (ANT 2017) (ANT 2017)

Aegis Aegis :: Reliable Reliable Application Application Execution Execution Over Over the the Mobile Mobile Cloud Cloud a,∗ a Shubhabrata Shubhabrata Sen Sena,∗,, J¨ J¨o orn rn W. W. Janneck Jannecka

a Department a Department

of Computer Science, Lund University, Lund 22100, Sweden of Computer Science, Lund University, Lund 22100, Sweden

Abstract Abstract With With the the advent advent of of IoT IoT and and the the associated associated variety variety of of pervasive pervasive and and context-aware context-aware applications, applications, there there is is an an increasing increasing requirement requirement to to support support the the execution execution of of these these applications applications on on devices devices with with limited limited processing processing power. power. This This is is aa cost-intensive cost-intensive process process as as this this usually requires the deployment of centralized computing infrastructure accessible via a cloud interface. We envision a distributed usually requires the deployment of centralized computing infrastructure accessible via a cloud interface. We envision a distributed execution execution environment environment comprised comprised of of diverse diverse computing computing resources resources as as an an alternative alternative solution solution to to this this problem. problem. This This execution execution enenvironment can be defined as a mobile cloud. However, the inherent unreliability of the mobile cloud requires developers vironment can be defined as a mobile cloud. However, the inherent unreliability of the mobile cloud requires developers to to add add reliability reliability within within the the applications applications separately separately making making the the development development process process tedious. tedious. In In this this position position paper, paper, we we present present Aegis Aegis -- aa framework that provides a reliable and unified computing platform with built-in failure detection and repair mechanisms. framework that provides a reliable and unified computing platform with built-in failure detection and repair mechanisms. Aegis Aegis draws draws upon upon the the actor actor based based execution execution model model as as well well as as stream stream processing processing applications applications to to provide provide aa reliable reliable overlay overlay over over unreliable unreliable environments. environments. Aegis Aegis also also relives relives developers developers of of the the task task of of adding adding reliability reliability mechanisms mechanisms to to applications applications separately. separately. c 2016 The Authors. Published by Elsevier B.V. c 2016 The©Authors. Elsevierby B.V. 1877-0509 2017 ThePublished Authors. by Published Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. Peer-review under responsibility of the Conference Program Chairs. computing ; stream processing ; mobile cloud ; application development ; position paper Keywords: Keywords: distributed distributed computing ; stream processing ; mobile cloud ; application development ; position paper

1. 1. Introduction Introduction The The advent advent of of ’smart’ ’smart’ devices devices in in the the recent recent years years has has led led to to aa scenario scenario wherein wherein the the notion notion of of computing computing is is no no longer longer restricted to desktops and laptops. The phenomenon that started with smartphones has now moved on to smart restricted to desktops and laptops. The phenomenon that started with smartphones has now moved on to smart cars, cars, homes, homes, offices, offices, public public spaces spaces and and even even entire entire cities. cities. As As the the number number of of these these smart smart entities entities rise, rise, aa whole whole ecosystem ecosystem of of supporting supporting applications applications and and tools tools involving involving these these entities entities has has started started to to take take shape. shape. The The attention attention that that the the Internet Internet of of Things(IoT) Things(IoT) paradigm paradigm is is attracting attracting from from the the industrial industrial as as well well as as the the academic academic community community is is proof proof of of the the fact fact that that these these smart smart entities entities and and the the applications applications built built around around them them will will continue continue to to dominate dominate the the technological technological landscape landscape in in the the near near future. However, application development for IoT continues to be a challenging task. Developers face the challenge future. However, application development for IoT continues to be a challenging task. Developers face the challenge of of working working with with the the wide wide variety variety of of programming programming models, models, communication communication protocols, protocols, platforms platforms and and programming programming environments. Another aspect of application development for IoT is the provision of adequate computing environments. Another aspect of application development for IoT is the provision of adequate computing power. power. As As smartphones and other devices with limited processing power become a part of the IoT ecosystem, deployments smartphones and other devices with limited processing power become a part of the IoT ecosystem, deployments of of sophisticated sophisticated applications applications with with these these devices devices require require the the presence presence of of additional additional dedicated dedicated computing computing infrastructure. infrastructure. ∗ ∗

Corresponding author Corresponding author E-mail address: [email protected] E-mail address: [email protected]

c 2016 The Authors. Published by Elsevier B.V. 1877-0509 c 2016 1877-0509 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. Peer-review©under of the Conference Program B.V. Chairs. 1877-0509 2017responsibility The Authors. Published by Elsevier

Peer-review under responsibility of the Conference Program Chairs. 10.1016/j.procs.2017.05.316

2

Shubhabrata Sen et al. / Procedia Computer Science 109C (2017) 482–489 Shubhabrata Sen and J¨orn W. Janneck / Procedia Computer Science 00 (2016) 000–000

483

This infrastructure can either be locally deployed or be a remote deployment accessible via a cloud interface. This often leads to an increase in the deployment cost. As the requirement for providing computational processing power grows, researchers are looking at techniques to provide this power with low cost overhead. One of the possible approaches looks towards utilizing the idle processing power of computing devices 1 . These can range from mobile devices like tablets to networked computing components present in places like universities and data warehouses 2 . In other words, these distributed computing nodes connected in an ad-hoc fashion can be envisioned as a mobile cloud. However, this mobile cloud is inherently unreliable due to its very nature of construction as the computing resources can fail abruptly and arbitrarily. This complicates the application development process as developers now need to be aware of this unreliability and account for it in their design and implementation. In this position paper, we propose the Aegis framework that exposes an unreliable and distributed execution environment as a reliable and unified computing platform to applications. Aegis abstracts the details of the resilience provision mechanisms from the developers and makes these unreliable distributed computing setups more accessible to application developers. The overall goal for the Aegis framework can be defined as follows - the creation of a set of models, languages and tools enabling developers to use an ad-hoc collection of dynamic, heterogeneous and unreliable computing devices as a single computing machine with a unified programming model. 2. The stream processing model The notion of reliability and resilience varies between different application domains as well as the specific requirements of applications. As part of our initial system design, we focus on executing the class of stream processing applications reliably over unreliable environments. These are data intensive applications that require rapid processing of large volumes of streaming data and need to be resilient to failures. Stream processing applications encompass a large class of application domains including audio/video streaming, media coding, cryptography, packet processing, signal processing etc. Stream processing has also found applications in IoT and context-aware scenarios including multimedia streaming, health data monitoring, monitoring ambient conditions in a smart home etc 3,4,5 . Stream processing allows some applications to exploit a limited form of parallel processing. These applications can use multiple computational units, such as the FPUs on a GPU or field programmable gate arrays (FPGAs) without explicitly managing allocation, synchronization, or communication among those units. A stream processing application comprises of a data source, a data sink and multiple computational nodes. The data source is responsible for generating a continuous stream of data. An operation is applied to each element in the stream by the computational nodes. The data sink acts as the receiver of the data stream generated as the output of the computational nodes. The connection between the computational nodes is usually pipelined. Further, we assume that the stream processing applications adhere to the Kahn Process Networks (KPN) 6 computational model where each computational node in a stream processing application acts as a Kahn process. A key characteristic of the KPN model is that it is deterministic in nature. This indicates that the same input/output relation always exists irrespective of the schedule chosen to evaluate the network. The Kahn processes are monotonic as well since tokens can’t be taken back and modified once they’ve been processed and put on the output queue. The monotonicity property also indicates that it may be possible to compute part of the output sequence if part of the input sequence is given. Kahn processes perform blocking reads and non-blocking writes. These processes also run in an autonomous fashion and do not require the presence of a global scheduler for synchronization. We will be using the properties of the KPN model as part of our design strategy of Aegis to provide a reliable execution environment for these applications. 3. Aegis - System overview The primary functionality of Aegis is to provide a layer of reliability to stream processing applications executing over unreliable mobile clouds and make them resilient to failures. As discussed earlier, the mobile cloud is an ad-hoc and distributed collection of computing resources. We define an application as a set of interconnected actors 7 . These actors are concurrently operating computational kernels and communicate with each other via buffered, lossless and order-preserving communication channels. Each actor executes in a series of atomic steps wherein a single step can entail 1) the consumption of an input token, 2) the production of an output token or 3) the modification of its own


484

3

Fig. 1. Aegis system overview.

internal state. As per our discussion of the KPN model in the preceding section, we can visualize each actor as a Kahn process. Fig. 1. provides an overview of the Aegis framework and its associated system components. Since the execution environment comprises of a distributed computing setup, the actors corresponding to the nodes of an application will need to be deployed on these physical machines. In order to provide a further level of abstraction, we define the execution environment for an actor as a runtime. A runtime is hosted on a physical machine and acts as a wrapper for one or more actors. All the communication between an actor and its outside world is carried out through its host runtime. The runtimes are also capable of communicating with each other. Aegis is composed of the following components • • • •

Application Initiator Runtime Manager System Coordinator Replacement Runtime Selector

The association of a stream processing application with Aegis is initiated by deploying it over the mobile cloud. Once the application is deployed, Aegis monitors the application and checks for any inconsistencies arising due to the failure of the underlying computational nodes. Once detected, Aegis strives to restore the system to a steady state. All these operations are carried out without requiring any involvement from the applications themselves. 3.1. Deploying an application The application initiator provides an interface for a developer to deploy their application on the mobile cloud. An application is described using a graph denoting the actors and the connections between them. Fig. 2. illustrates the deployment of an application on Aegis. Prior to the deployment of the application on the mobile cloud, Aegis replicates the application logic using controlled redundancy. This is a standard procedure followed in distributed computing to provide an initial level of reliability. The initiator then maps the actors and their replicas to the available runtimes in the execution environment with the help of the runtime manager. It is assisted in this task by the runtime manager that selects a set of destination runtimes to deploy the application. The actor-runtime mapping for a given application needs to satisfy the following two conditions 1. An actor and its replica can’t be assigned to the same runtime. Likewise, two actors that are replicas of the same actor can’t be assigned to the same runtime. 2. A pair of actors that have a direct communication link between them can’t be assigned to the same runtime We will observe the implication of these rules subsequently. After the deployment, the initiator notifies the data sources to start the data transmission process. The initiator also provides a copy of the application graph to the

4


485

Fig. 2. Deploying an application over Aegis.

runtimes selected for deploying the application. Each runtime also contains the information about the actors deployed in it including the identifiers of the actors as well as the locations where the actor code is located. 3.2. Actor-Runtime interaction As a runtime serves as the execution environment for an actor, the only interaction that an actor has with the outside world is through that runtime. The main task of the actor is to perform the computational processing on the set of the tokens received. Even though a runtime can host multiple actors, we choose to illustrate a scenario with a single runtime-actor pairing for the sake of clarity. Each data token contains a list of the actors that it’s intended for as well as information about its sender. For a given runtime-actor pairing, data tokens are received from multiple sources due to the replication. In Fig. 2., the actor in Runtime 5 receives tokens from Runtime 1 and Runtime 4. Since the incoming data streams are identical, only one copy of the data token constituting the stream is sent to the actor via the input buffer. The consumption of an input token and the production of an output token denotes the firing phase of an actor. This phase is initiated by an actor when it has input tokens available for consumption on its input buffers. Once the actor fires, it consumes the input token(s) and performs the relevant computational function on them. The actor then places the generated token on the corresponding output buffer. After the processed token have been placed on the output buffer, the runtime sends them to the appropriate destination nodes. 4. Reliable application execution with Aegis 4.1. Detecting a failure As Aegis operates over a unreliable and distributed computing environment, the physical machines in the environment can shut down at any time or become unresponsive. Consequently, the runtimes on these machines shut down as well causing the actors hosted in them to terminate. As discussed earlier, each actor receives redundant incoming data streams from one or more sources owing to the replication step. Consequently, the failure of a runtime will be noticeable to the actors that receive data from the actor in the failed runtime. Specifically, the actors receiving data will note that the input stream from the failed runtime lags with respect to the streams from the other replicas. When an actor detects this inconsistency, it notifies its host runtime to raise an alert corresponding to the failed runtime. As illustrated in Fig. 3., the failure of Runtime 1 is detected by Runtime 2 as well as Runtime 4. Both these runtimes receive data from Runtime1 and notice the inconsistency due to the failed runtime. Currently, we do not rely on a distributed consensus among the affected runtimes to confirm the occurrence of the failure. Rather, we allow each runtime to independently detect and report the error. We adopt this strategy to improve the turnaround time of the system after detecting a failure. It needs to be noted here that the inconsistencies between the received data streams can also occur due to problems in the underlying communication channel. However, we assume that the communication network is reliable and any inconsistency noted is taken to be indicative of a runtime failure.

486


5

Fig. 3. Initiating system repair after failure detection.

4.2. System repair after failure detection In order to keep the system running during the repair phase, runtimes detecting a failure temporarily suspend the process of checking their input data streams for inconsistencies. This is done to ensure that a runtime does not keep reporting the failure of the same runtime continuously. After the runtimes have detected the failure, they send a notification to the system coordinator with the identity of the failed runtime. As multiple runtimes send the failure notification corresponding to a particular runtime, the system coordinator initiates the system repair based on the first notification received and ignores the rest. The coordinator now informs the replacement runtime selector to select a replacement runtime in conjunction with the runtime manager. The selection of the new runtime needs to satisfy the two conditions for an actor-runtime pairing discussed previously. After the replacement runtime is chosen, the system coordinator informs the first runtime to report the failure about the replacement runtime. As per Fig. 3., Runtime 4 receives this notification from the coordinator. 4.3. Restoring the system to a stable state The next step after identifying a replacement runtime is to integrate it into the existing application setup and perform the necessary state migration operations to maintain the state of the system. The integration and the state migration operation comprises of the following steps • Deploy a copy of the actor(s) running in the failed runtime to the replacement runtime • Reconfigure the network connections • Maintain the consistency of the input streams of the replacement runtime with the other runtimes Due to the use of redundancy, Aegis ensures that each actor has one or more active replicas. A new copy of the actor in the replacement runtime can be deployed with the assistance of any of these replicas. As discussed previously, we ensure that an actor and its replica (or the replicas of the same actor) are not assigned to the same runtime. This is done with a rationale to ensure that all the copies of an actor do not perish due to the failure of a single runtime. The runtime detecting the failure notifies the runtimes sending data to it about the actor(s) in the failed runtime and instructs them to deploy a copy of the actor (if available) to the replacement runtime. Each runtime contains the code locations of the actors executing therein. The migration step simply involves fetching a new copy of the actor code and deploying it on the replacement runtime. As the actors represent self-contained execution units, this step is straightforward to realize. As illustrated in Fig. 4., Runtime 3 sends a copy of the actor deployment information to the replacement runtime. The next step is to reconnect the replacement runtime to the correct source and destination nodes that the failed runtime was connected to. This is required to integrate it fully into the existing application setup. We again make use of the redundancy present in the system to retrieve the connection information of the appropriate source and destination nodes. This information can be retrieved from a runtime that hosts a replica of the actor(s) in the failed runtime. As a direct consequence of the redundancy, such a runtime is connected to the same source and destination

6


487

Fig. 4. Integrating the replacement runtime into the existing setup.

nodes as the failed runtime. As can be observed in Fig. 4., Runtime 3 fulfils this condition and the information about the source and destination nodes can be retrieved from it. Although we assume a single actor-runtime pairing in the current discussion for the sake of clarity, it is important to note that a runtime can host actors belonging to multiple applications. This fact needs to be taken into account while retrieving the information about the source and destination nodes. We need to ensure that only those nodes are selected that correspond to the actor(s) in the failed runtime. This information can be obtained from the copy of the application graph present in each runtime. After the appropriate source and destination nodes have been identified, their network information is replicated and sent to the replacement runtime. This completes the integration process and the replacement runtime is now part of the existing application setup. As the stream processing applications continue to execute and data keeps flowing through the system while the repair process is underway, we need to ensure that the input streams of the replacement runtime are consistent with the other runtimes. In other words, we need to migrate the state information of the failed runtime to the replacement runtime. Again, we utilize the redundancy present in the system to carry out this step. As established previously, Runtime 3 is connected to the same source and destination nodes as the failed runtime in the application setup in Fig. 4. Thus, Runtime 3 has the same internal state as the failed runtime. The migration of the state information is now a simple matter of copying the input buffers from Runtime 3 to the replacement runtime. Finally, in order to ensure that no incoming tokens are lost in transit during the repair process, a temporary link is setup between the replacement runtime and and its replica i.e. Runtime 3. This link is used by the replica to send the tokens arriving during the repair process. This forwarding is carried out until the input streams in the replacement runtime have synchronized with the rest of the system. The forwarding process is terminated after that. After the system has reached a steady state, runtimes responsible for the failure detection re-enable the process of checking their input data streams for inconsistencies to detect subsequent failures. An interesting point to be noted here is that although the system repair is carried out while the system is still running, there is no bearing of this repair process on the overall output of the application. This follows as a direct consequence of the determinism and monotonic properties of the stream processing applications. 5. Related Work As ubiquitous streaming services become popular on mobile devices, developers need to address the challenges associated with the same. Context-aware video streaming services 3,8 aim to provide uninterrupted video streaming during wireless network handovers using metrics from the network as well as the mobile client. The developers of these applications need to be aware of the limitations of the client devices and the execution environment. In contrast, we strive to keep the resilience provision mechanisms of Aegis transparent to the client devices and the applications. Also, the focus is primarily towards optimizing the application behavior so that the resource utilization is more efficient. The problem of providing resilience to stream processing applications over a distributed environment

488


7

has been studied independently as well. There are several approaches designed to provide resilience and recover from failures that can be classified into three main categories precise recovery, rollback recovery and gap recovery. The gap recovery technique is the weakest technique among the three as loss of information is guaranteed to happen in this case. Rollback recovery technique ensures that the loss of information is minimal but there is an impact on the system other than increased latency as compared to precise recovery techniques where there is no visible impact of a failure except increased latency.One of the earliest efforts in the development of a stream processing engine was the Aurora 9 system that was initially designed as a centralized system but was later on integrated with the Medusa system 10 to function as a distributed system. The Aurora system provides reliability by ensuring that all nodes maintain a copy of their input tokens until their corresponding destination nodes have acknowledged the reception of the same. Although easy to realize, this scheme runs into complications when the number of destination nodes are large. An improvement over Aurora was proposed in the Borealis system 11,12 where reliability provision is carried out using active backup nodes that maintain a copy of a processing node but are not part of the application flow. The failure of a processing node is dealt with by replacing it with one of these backups. A separate class of techniques focus on providing availability with minimum latency by using fail-stop failures of processing nodes 13,14,15 . The execution of stream processing applications reliably in a distributed environment usually involves the use of replication and redundancy that is used in Aegis as well. However, Aegis does not rely on the presence of backup nodes and carries out the failure recovery by generating a new replica node dynamically at runtime. Further, as compared to the fail-stop based model of failure recovery 13,14,15 , Aegis aims to keep the overall system disruptions minimal by carrying out the failure detection and repair process while the underlying system keeps running. Also, as compared to the strategies requiring the active participation of the applications in the fault detection and recovery 12,13,14,15 , Aegis provides this functionality transparent to the applications. This promotes the reuse of applications across reliable and unreliable platforms. Aegis also shares some similarity with Calvin 16 particularly in the utilization of the actor model and the design of a platform independent model for IoT applications. However, we place an emphasis on providing reliability to applications executing over unreliable environments whereas the focus of Calvin is to provide a generic platform for creating and executing IoT applications. 6. Future Work In the previous sections, we have provided an overview of the Aegis framework together with an understanding of the failure detection and repair mechanisms provided by it. As part of our initial research efforts, we have implemented a basic system prototype of the Aegis framework using Java. The source code for the prototype is available for study in a bitbucket repository 17 . This prototype contains the basic system components of the Aegis framework and supports the failure detection and repair mechanisms discussed in the paper. We are currently working towards enhancing our current system prototype to support the deployment of multiple-actor based stream processing applications. We also plan to improve upon certain initial simplifying assumptions made as part of our initial development phase. One of these issues include the improvement of the process of selecting a new replacement runtime in case of a failure. Ideally, the selection of a new runtime would depend on several factors including the current execution load on a runtime and the network reconnection cost to connect the new runtime to the existing application setup. Also, we plan to develop a distributed consensus approach to select the new runtime in addition to the centralized coordinator currently in use. This is a challenging problem as it will involve each runtime to have knowledge about its local network topography together with the properties of the runtimes in its neighborhood. Eventually, we intend to deploy this system prototype over a testbed and carry out a set of experimental evaluations to assess the system. This testbed will comprise of a set of networked computing components with each computing node hosting one or more runtimes. As the primary functionality of Aegis is to provide a reliable execution environment for stream processing applications, we plan to simulate system failures with different failure probabilities and study the repair and recovery behavior of the system. We are particularly interested in observing the behavior of Aegis in a scenario where the failure rate overcomes the systems ability to repair itself. It is important to study the capability of Aegis to withstand successive failures as we intend to provide continuous online optimization through Aegis. A large scale failure comprising of several simultaneously failing nodes might require either a partial or a complete restart of the system. We also plan to study the overhead involved in the system repair and recovery process including the time needed to perform a system repair as well as well as the associated communication overheads. Another

8


489

area where we would like to enhance the functionality of Aegis is with regards to the processing of non-monotonic stream processing applications. Apart from the experimental evaluations, our agenda for future work also includes the development of a distributed consensus based failure repair strategy as well as devise strategies for continuous online optimization of the system. 7. Conclusion As the availability of smart handheld devices and internet accessibility increases, the paradigm of ubiquitous and context-aware computing systems are set to play an important part in the computing world in the future. However, the limitations of battery life and processing power in these devices pose a constraint in the large scale deployment of such systems. Further, the increasing popularity of streaming applications and the demand for context-aware streaming services create the requirement to support the efficient processing of such applications over handheld devices. In this paper, we envision a scenario where the unused computing power distributed across the internet in the form of networked computing components can be used to support the processing of streaming applications. In order to make this distributed and unreliable execution environment appear as a resilient and unified computing platform, we propose the Aegis framework that is capable of detecting the node failures in the execution environment and repairing the defects arising in the system due to the same. Since the reliability mechanisms in Aegis are provided through software as opposed to hardware, we believe that scalable resilience to failures for stream processing applications can be achieved through the Aegis framework as it not constrained by the processor architecture. We have completed the first stage of development of a prototype to execute distributed streaming applications reliably. As part of the future work, we intend to carry out a set of experimental evaluations to measure the system performance as well as provide support for a more generic actor based stream processing execution model. References 1. J. Redman, Suchflex Turns Idle Computing Power Into an Incentivized Network. URL https://news.bitcoin.com/suchflex-computing-power-network 2. H. Ba, W. Heinzelman, C. A. Janssen, J. Shi, Mobile computing - a green computing resource, in: 2013 IEEE Wireless Communications and Networking Conference (WCNC), 2013, pp. 4451–4456. doi:10.1109/WCNC.2013.6555295. 3. J. Y. Pyun, Context-aware streaming video system for vertical handover over wireless overlay network, IEEE Transactions on Consumer Electronics 54 (1) (2008) 71–79. doi:10.1109/TCE.2008.4470026. 4. H. Rahimi, A. A. N. Shirehjini, S. Shirmohammadi, Context-aware prioritized game streaming, 2012 IEEE International Conference on Multimedia and Expo 0 (2011) 1–6. doi:http://doi.ieeecomputersociety.org/10.1109/ICME.2011.6012220. 5. D. O. Kang, H. J. Lee, E. J. Ko, K. Kang, J. Lee, A wearable context aware system for ubiquitous healthcare, in: Engineering in Medicine and Biology Society, 2006. EMBS ’06. 28th Annual International Conference of the IEEE, 2006, pp. 5192–5195. doi:10.1109/IEMBS.2006.259538. 6. K. Gilles, The semantics of a simple language for parallel programming, In Information Processing 74 (1974) 471–475. 7. C. Hewitt, P. Bishop, R. Steiger, A universal modular actor formalism for artificial intelligence, in: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, IJCAI’73, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1973, pp. 235–245. URL http://dl.acm.org/citation.cfm?id=1624775.1624804 8. D. Kim, D. Yun, K. Chung, Context-aware multimedia quality adaptation for smart streaming, in: 2014 International Conference on Information and Communication Technology Convergence (ICTC), 2014, pp. 383–388. doi:10.1109/ICTC.2014.6983162. 9. M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S. B. Zdonik, Scalable distributed stream processing., in: CIDR, Vol. 3, 2003, pp. 257–268. 10. U. Cetintemel, The aurora and medusa projects, Data Engineering 51 (3). 11. D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, et al., The design of the borealis stream processing engine., in: CIDR, Vol. 5, 2005, pp. 277–289. 12. M. Balazinska, H. Balakrishnan, S. R. Madden, M. Stonebraker, Fault-tolerance in the borealis distributed stream processing system, ACM Transactions on Database Systems (TODS) 33 (1) (2008) 3. 13. J. H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, S. Zdonik, High-availability algorithms for distributed stream processing, in: 21st International Conference on Data Engineering (ICDE’05), 2005, pp. 779–790. doi:10.1109/ICDE.2005.72. 14. M. A. Shah, J. M. Hellerstein, E. Brewer, Highly available, fault-tolerant, parallel dataflows, in: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, ACM, 2004, pp. 827–838. 15. P. A. Tucker, D. Maier, Dealing with disorder, MPDS, June. 16. P. Persson, O. Angelsmark, Calvin–merging cloud and iot, Procedia Computer Science 52 (2015) 210–217. 17. S. Sen, The Aegis framework. URL https://bitbucket.org/shubhabrata/ft-dsp