cyberphysical-system-on-chip

CECS-TR-13-06: CYBERPHYSICAL-SYSTEM-ON-CHIP (CPSOC)

1

CyberPhysical-System-On-Chip (CPSoC): Sensor-Actuator Rich Self-Aware Computational Platform S.Sarma, N.Dutt, P.Gupta† , N. Venkatasubramanian, A. Nicolau

I. I NTRODUCTIONS YBERPHYSICAL systems (CPSs)[1] are physical and engineered systems whose operations are monitored, coordinated, controlled, and integrated by a computing, control, and communication core. This intimate coupling between the cyber and physical spheres manifests in many largescale wide-area systems-of-systems (e.g., aerospace systems, transportation vehicles, intelligent highways, smart grids) into computing-communication-control (C3) codesign resulting in improved system efficiency in the face of dynamic environmental changes. Since such large computer controlled systems interact with the physical world, they must operate dependably, safely, securely, efficiently, and in real-time. With the increasing complexity and expanded functionality of Multiprocessor Systems-on-Chip (MPSoCs), there is an urgent need to make these platforms adaptive and self-aware. We present the notion of sensor- and actuator-rich CyberPhysical Systems-on-Chip (CPSoCs) that marry traditional MPSoCs with the adaptive and self-learning capabilities of CPSs to enable under-designed and opportunistic (UNO) computing systems [2].

C

The authors are with the Department of Computer Science, University of California, Irvine, USA, and † Department of Electrical Engineering, University of California, Los Angeles, USA; e-mail: {santanus, dutt, nalini,nicolau}@ics.uci.edu; † [email protected]. This material is based on work supported by the NSF under awards CCF-1029783 (Variability Expedition) and CNS-1063596 (Cypress).

AO AN AH AC

Actuators (Act)

Applica'ons

SA

Opera'ng System

SO

Network/Bus Communica'on Architecture

SN

Hardware Architecture

SH

Device/Circuit Architecture

SC

SENSOR FUSION

Index Terms—Cyber Physical Systems, Cross-Layer Approach, Self-Aware Computing, Adaptive Computing, MPSoC, CyberPhysical-System-On-Chip (CPSoC).

AA

ACTUATOR FUSION

Abstract—Cyberphysical systems (CPSs) are physical and engineered systems whose operations are monitored, coordinated, controlled, and integrated by a computing, control, and communication core. We propose Cyberphysical-system-on-chips (CPSoC), a new class of sensor and actuator-rich multiprocessor systemson-chip (MPSoCs), that augment MPSoCs with additional onchip and cross-layer sensing and actuation capabilities to enable self-awareness within the observe-decide-act (ODA) paradigm. Unlike traditional MPSoC designs, CPSoC differs primarily on the co-design of control, communication, and computing systems that interacts with the physical environment in real-time in order to adapt system behavior so as to dynamically react to environmental changes while achieving overall design goals. We illustrate CPSoC’s potential through several examples and applications.

Sensors (Observer) Adap've Control Policies (Decide) Middleware/Firmware Supervisory Policies

Figure 1. Cross-layer virtual sensing and actuation at different layers of CPSoC.

MPSoCs have been widely adopted for embedded signal processing, multimedia computing, and application-specific designs [3]. Although they have long history of use in embedded computing, recently they emerged as a viable platform for server/desktop computing with key architectural driving factors such as scalability, multitasking, programmability, realtime performance, and low-power operations. With increasing number of cores and large on-chip resources, coupled with the move towards nanometer technologies, future MPSoCs face multiple challenges: sustained performance in the face of dynamic and unknown workload distributions; increased vulnerability to environmental and aging effects that induce errors; and thermal and power management, to name just a few. MPSoCs face exponentially increasing power density and heating, creating drastic and harsh environments (e.g., hotspots), that manifest as aging as well as wear-out phenomena (e.g., NBTI, HCI, TDDB, Electromigration etc. [2]) which further increase susceptibility to errors in time over its usage life. The variability plagued semiconductor substrate results in substantial fluctuations in critical device/circuit parameters over time [4] as well as for with-in-die and between dies variations with the immediate consequence of diminishing yield and reliability. Traditionally a worst case design that introduces large design margins (i.e., guard-bands) of the order of more than 50 % (3σ) for future technology nodes [4], is used to enhance yield and reliability. Such worst case designs lack the ability to extract untapped performance and energy potential by relaxing variation-induced guard-bands [2]. Given the statistical and ill-understood models of several of these variability/reliability mechanisms, on-chip sensing may be the


2

App 2

App 1

Observe Reflective Middleware Layer

Timer & RTC

App N

Decide

Decisions & Learning (Controller)

Cross-Layer Sensors (Virtual & Physical)

On-chip Actuation Unit

DDRO(s)

PLL

Oxide Sensor(s) Temperature Sensor(s)

Performance Counters

Act

Leakage Sensor(s)

NIA

CPSCore

Application Layer

On-Chip Sensing & Actuation (OCSA)

CPU(s)

Aging Sensor(s)

Scratch pad/ On-Chip SRAM

$D

$I

Reliability Sensor(s)

GPIO $L2

Actuation (software and hardware) CPU

CPU

CPU

$D

$I

Traditional Operating System Scheduling

Memory Manager

File System

NIA

OCSA

NoC Router

CPU

OCSA

NoC Router OCSA

CPU

CPU

$D

$I

CPS Core Adaptive Router

CPU

CPU

NoC Router

NIA

OCSA

OCSA

NoC Router

NoC Router

OCSA

$D $L2

$L2

NIA

CPU

$I

$D

$I

$L2

OCSA

$D $L2

OCSA

NIA

CPU

$I NIA

OCSA

NoC Router

Device Drivers

CPU

$D $L2

OCSA

Hypervisor

CPU

$I

$L2

NIA

OCSA

OCSA

Chip Hardware CPU

CPU

CPU

OCSA

CPU

$D

NIA

OCSA

NoC Router OCSA

CPU $D

$I

$L2

$L2

NoC Router

CPU

$I

$D

$I NIA

$L2

NIA

OCSA

OCSA

NoC Router OCSA

Figure 2. CPSoC architecture with adaptive Core, NoC, and the ODA Loop as Middleware.

only reliable way to deal with them. As a result, we expect future MPSoCs to have a large number of sensors due to (1) more devices on the MPSoC, (2) more phenomena to sense. In this paper, we propose a novel architecture and design paradigm called CPSoC that consist of a combination of sensor-actuator-rich, self-aware computation platform and adaptive network-on-chip (NoC) along with an adaptive & reflective middleware (a flexible hardware-software stack and interface between the application and OS layer) to control the manifestations of computations (e.g., aging, overheating, parameter variability etc.) on the physical characteristics of the chip itself and the outside interacting environment. With crosslayer sensor enabled self-awareness and the ability to react to present action, predict future actions, and evaluate past actions, these CPSoCs will be capable of adapting their behavior and resources to automatically find the best way to accomplish a given goal despite changing environmental conditions and demands. The rest of the this paper is organized as follows. Section II reviews related work and positions CPSoC. Section III describes the architecture of CPSoC, while Section IV outlines the attributes and features of CPSoC. Section V presents sample runtime / adaptation policies to demonstrate how CPSoC can improve performance and extend chip lifetime. Section VI analyzes the architectural overhead of CPSoC’s constituent components. Section VII and VIIIdescribe CPSoC application use cases and the experimental framework being developed. Finally Section IX conclude this paper.

II. R ELATED W ORK AND M OTIVATION Emerging many-core computing architectures [5] appear in two classes: homogenous [6], [7], [8], [9] and heterogeneous configurations [10], [11]. Homogenous tiled manycore architectures with shared/distributed memories connected through a Network-on-Chip (NoC) fabric [12], [13] are emerging as general-purpose platforms (e.g., Tilera [8] and Intel Single Chip Computer [12], [6]). On the other hand, highperformance embedded computing platforms typically deploy heterogeneous combinations of multi/many-core architectures [10], with tens to thousands of small and big cores [14], [7], to deliver unprecedented performance within an affordable power envelope. These advances are driven by Moores Law, demanding radical changes in both the architecture and the entire software stack [15]. Traditionally, MPSoC architectures [3] have been driven by features that improve performance, but constrained by power and thermal budgets[16], [3], [17]. Power-aware [18], [19], [20], [21], thermal-aware [22], [23], [24], [25], [26], and reliability-aware [27], [28] microarchitecture had been proposed over the last decade. However, a computing framework that addresses and assures the dependability of the information processing (i.e., the cyber aspects such as integrity, correctness, accuracy, timing, reliability and security) while simultaneously addressing the physical manifestations (in performance, power, thermal, aging, wear out, material degradation, and reliability and dependability) of the information processing on the underlying computing


platform, specifically SoC, has been missing. CyberPhysicalSystem-on-Chip (CPSoC) aims to coalesce these two traditionally disjoint aspects/abstractions of cyber/information world and the underlying physical computing worlds into a unified abstraction of computing. This helps to deal with the increased vulnerability due to semiconductor process variabilities as well as environmental and aging effects that induce errors. Cyber-Physical systems (CPS) [29], whose operations are monitored, coordinated, controlled, and integrated by a computing, control, and communication paradigm, can be designed [30] to provide the assured, guaranteed, as well as graceful degradation in quality of service [31], [32]. These systems have traditionally been distributed over a large geographically spread. CPSoC derives its inspiration and brings the parallels to large scale MPSoCs connected by on-chip networks [13], [17] in nano scale dimensions. Equipped with advances in machine learning [33], [34], distributed control system design [35], [36], [37], sensor networks [38], [39], sensor fusion [40], [41], [42], [43], [44], which had often found wide applications in CPS, CPSoC brings many such attributes directly on-chip for an intelligent, self-aware, autonomic computing fabric, which had been the goals of large scale autonomic systems [45], [46], [47], [48]. Specialized features such as increasing customization [7], [49], heterogeneity [11], adaptation and agility [50], synergistic cross-layer cooperation [51], and opportunistic actuator fusion are some key attributes for CPSoC computing platforms. III. A RCHITECTURE OF CPS O C The CPSoC architecture consist of a combination of sensoractuator-rich computation platform supported by adaptive NoCs (cNoC, sNoC), Introspective Sentient Units (ISU), and an adaptive & reflective middleware to manage and control both the cyber/information and physical environment and characteristics of the chip. The CPSoC architecture is broadly divided into several layers of abstractions, for example, applications, operating system, network and bus communication, hardware, and the circuit / device layers. CPSoC inherits most features of MPSoC in addition to on-chip sensing and actuation to enable the ODA paradigm. Unlike traditional MPSoC [3], each layer of the CPSoC can be made selfaware and adaptive, by a combination of software and physical sensors and actuators as shown in Fig. 1. These layer specific feedback loops are integrated into a flexible stack which can be implemented either as a firmware or middleware as shown by the dotted line in Fig. 1. CPSoC distinctly differs from a traditional MPSoC in several ways as shown in Table. I. Traditional MPSoC paradigms lack the ability to sense the system states and behaviors across layers of system stack due to lack of architectural support; they are incapable of exploiting and exposing process and workload variations due to a lack of suitable abstractions at multiple layers. Furthermore, they sacrifice usable performance and energy potentials by adopting worst case design (guard-bands), and lack support for multi-level actuation mechanisms and adaptations to aggressively meet competing and conflicting demands. Moreover, traditional MPSoCs lack self-learning

3

mechanisms that can anticipate failures and predict vulnerabilities. The CPSoC architecture overcomes these limitations as detailed in the following sections. Table I T RADITIONAL MPS O C A"ributes

Tradi.onal MPSoC

VS

CPS O C CPSoC

Objec.ve

Efficient Resource Mangement for performance and power

Design Paradigm

Computa7on-‐Communica7on Computa7on-‐Control-‐Communica7on Centric Centric

Physical Sensing & Actua.on

Nil or limited sensing ability

Sensor Rich /Many On-‐chip Sensors

Cross-‐layer Virtual Sensing & actua.on

Not Applicable

Yes (at each layer )

Control System (ODA loop)

Open-‐loop response or actua7on

Feedback based Closed-‐loop, Hierarchical, Distributed control

Opera.ng System

Tradi7onal Opera7ng System

OS with Middleware (ODA control loop)

Communica.on Architecture

Tradi7onal NoC

Adap7ve NoC ((Bandwidth & Channel)

Resource Management

Ad-‐hoc or Limited

Full Resource-‐Awareness & Adap7ve Resource Management

Scalability

User assisted parallel execu7on (mostly sta7c)

Automa7c paralleliza7on of applica7ons (dynamic)

Adap7ve System /w Mul7-‐Objec7ve : (Reliability, Yield, Performance, Power/ Energy, Thermal, Security)

The CPSoC computation fabric with on-chip sensing and actuations (OCSA) for homogenous and strongly heterogenous architecture is shown in Fig. 3. In this distributed sensing and actuation approach, each component and core is can make the decision to manage the fabric. A example core architecture with such sensing and actuation mechanism is shown in Fig. 4. Note that as the sensing and actuation mechanism in such a configuration need to be piggy backed on the traditional best effort core-to-core communication NoC, additional design considerations (e.g., real-time processing of sensors and actuators) need to be consider. On the other hand, such concern can be handed to a dedicated custom network which can be specialized to meet these requirements efficiently. One such example specialized sensing network to efficiently sense several distributed sensors types is shown in Fig. 5. The information generated by these sensor types need to be aggregated and processed. To achieve that, CPSoC’s introduces a specialized unit called introspective sentient units (ISU) to collect, monitor, and process the sensing data and make meaning about the system’s present states and context as well as future states. Each ISU is a processors based system that is interfaced to the sNoC where the virtual sensing approach is performed. The information from the ISUs are distributed to the computational cores. The placement and configuration of the ISUs can be determined based on the overheads and communication bottlenecks. A small core (e.g. Cortex M3) may be sufficient for low processing requirements. On the other hand, when aggressive processing is required to model and predict various phenomena in real-time, larger cores (e.g.. A9) can be used. As an example, see the distribution of several ISUs across the fabric as shown in Fig. 6.


4

CPU

CPU

CPU

CPU

CPU

$I

$D

$I

$D

$I

$L2

NI A

$L2

NI A

OCSA

NoC Router

NI A

OCSA

OCSA

CPS Core

NoC Router

OCSA

OCSA

Adaptive Router

CPU

CPU

CPU

CPU

CPU

$I

$D

$I

$D

$I

$L2

NI A

$D $L2

NoC Router

OCSA

CPU

OCSA

NoC Router

NI A

OCSA

OCSA

NoC Router

NoC Router

OCSA

OCSA

OCSA

CPU

CPU

CPU

CPU

CPU

$I

$D

$I

$D

$I

$L2

$L2

NI A

NI A

OCSA

$D

OCSA

NoC Router

OCSA

OCSA

CPU

$L2

NI A

OCSA

NoC Router

NoC Router

$D $L2

$L2

NI A

CPU

OCSA

(a)

(b)

Figure 3. CPSoC Computational Fabric with on-chip sensing and actuations (OCSA) for (a) homogenous titled architecture (b) strongly heterogenous architecture.

ACTUATORS

I/F

Core

ISU

Introspec@ve Sen@ent Unit (ISU)

Single-‐Thread Mode

ACCELERATORS Core

Throughput Core

Throughput Core

A

B

C

D

E

F

G

H

R

R

Memories

Throughput Core

Throughput Mode

Throughput Core

CORE

SENSORS

Core

Throughput Mode

NIA

Figure 4. CPSoC core with on-chip sensors and actuators connected to the local bus. The cores have independent L1 cache and may share L2 and L3 cache with other cores.

IV. ATTRIBUTES AND F EATURES OF CPS O C The CPSoC framework supports four key ideas: 1) physical and virtual sensing and actuation 2) self-awareness and adaptation 3) multi or cross-layer interactions and interventions 4) predictive modeling and learning. We briefly describe these below. A. Cross-Layer Virtual and Physical Sensing & Actuation The CPSoCs are sensor-actuator-rich MPSoC that include several on-chip physical sensors (such as Aging, Oxide Breakdown, Leakage, Reliability (error signatures), Temperature,

A

R

R

R

R

MEMORIES

R

R

R

R

B

C

D

E

F

G

H

Figure 5. CPSoC with the distributed sensor network (sNoC) and the introspective sentient unit (ISU). A ring network with a time division multiplexed (TDM) router connects all the distributed sensors of different types in cores, memories and accelerators.

Performance Counters, as well as Voltage, Current, and Power sensors) on the lower three layers as shown by the on-chipsensing-and-actuation block (OCSN) in Fig. 2 and tabulated in Table II. On the other hand, virtual sensing [51] is a physicalsensor-less sensing of immeasurable parameters using indirect computation. It’s a software sensor [51] that provides indirect measurement of abstract conditions, contexts, inferences or estimates by processing (e.g., combining, aggregating, or predicting) sensed data from either a set of homogeneous or heterogeneous sensors. It’s a computational technique that enhances and/or adds sensing capability, introduces sensing options, increases sensitivity, enables efficient sensor resource uses, and overcomes physical placement and cost restrictions.


A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

A7 A7 A7 M3 A7 A7 A7 A7 A7 M3 A7 A7

A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A15 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

A7 A7 A7 M3 A7 A7 A7 A7 A7 M3 A7 A7

A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

A7 A9 M3 A7 A7 A7 A7 M3 A9 A7

A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A15 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

A9

A9

A9

A9

A7 A7 A7 A7 A7 A7 A7 A7 A7

A9

A9

A9

A9

A7 A9 M3 A7 A7 A7 A7 M3 A9 A7

A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7 A7

(b)

(a) A7 A7 A7 A7 M3 A7 A7 A7 A7

5

A7 A7 A7 A7 M3 A7 A7 A7 A7

A9

A9

A9

A9

x1 x4 x7

x2 x5 x8

x3 x6 x9

A9

A9

A9

A9

x1 x4 x7

x2 x5 x8

x3 x6 x9

A9

A9

A9

A9

A7 A7 A7 A7 M3 A7 A7 A7 A7

A9

A9

A15

A15

LLC

A15

A15

of abstraction. B. Self-Awareness and Adaptation Self-awareness is used to describe the ability of the CPSoC to observe its own internal behaviors as well as external systems it interacts with such that it is capable of making judicious decisions that optimize performance and other quality of service (QoS) metrics. Self-aware systems will be capable of adapting their behavior and resources to automatically find the best way to accomplish a given goal despite changing environmental conditions and demands. A self-aware system must be able to monitor its behavior to update one or more of its components (hardware architecture, operating system and running applications), to achieve its goals.

LLC

A7 A7 A7 A7 A7 A7 A7 A7 A7

A9

A9

A9

A9

A7 A7 A7 A7 M3 A7 A7 A7 A7

A9

A9

A9

A9

A15

A15

(c)

A9

A9

A15

A15

(d)

Figure 6. Platform architecture with Introspective Sentient Units (ISU) (marked in red color) (a) asymmetric architecture (b) cluster (c) heterogeneous MPSoC (d)strongly heterogenous MPSoC with accelerators.

When combined with different kinds of sensors, virtual sensing enables consensus to resolve faults, find cause of faults, and calibration errors while providing a test bed for sensor fusion.

C. Multi or Cross-Layer Interactions and Interventions Two key attributes of the self-aware CPSoC– unlike autonomous computing [45], and invasive computing [52] – are adaptation of the layers and multiple cooperative ODA loops. As an example, the unification of an adaptive computing platform (with combined DVFS, ABB, and other actuation means) along with a bandwidth adaptive NoC offers a completely different approach (extra dimensions of control) and solutions in comparison to traditional MPSoC architecture. These cooperative and hierarchical control loops –e.g., the combination of traditional control loop (dotted lower box in Fig. 7) together with virtual sensing enabled optimized loop (upper loop in Fig. 7) – effectively translate user goals or QoS into one or more multiple design objectives.

Table II V IRTUAL /P HYSICAL S ENSING AND ACTUATIONS ACROSS L AYERS

Layers

Virtual/Physical Sensors

Virtual/Physical Actuators

D. Predictive Modeling and Learning

Applica6on

Workload; Power, Energy, Execu5on Time,

Loop perfora5on Algorithmic Choice

Opera6ng System

System U5liza5on Peripheral States

Task Alloca5on, Scheduling, Migra5on, Duty Cycling

Predictive modeling and learning abilities of the system performance, environment, and external and internal disturbances provides the intelligence in the CPSoC paradigm. Use of coupling parameters (a metric that quantifies the interactions between layers) helps to develop application and cross-layer interaction models for nominal and abnormal operations. The approach applied to predict performance, power, and thermal characteristics of CPSoC architectures at runtime[53], [54], [55]. The learning abilities of CPSoC improve autonomy in managing the system and help to provide proactive resource discovery. The online learning abilities of the CPSoC in system identification and runtime adaptation is dicussed in [56], [57].

Network/Bus Bandwidth; Packet/Flit status; Communica6on Channel Status, Conges5on, Latency

Adap5ve Rou5ng Dynamic Bandwidth Alloca5on Ch. no and direc5on

Hardware Architecture

Cache misses, Miss rate; access Cache Sizing; Reconfigura5on, rate; IPC, Throughput, ILP/MLP, Resource Provision Core asymmetry Sta5c/Dynamic Redundancy

Circuit/Device

Circuit Delay, Aging, leakage DVFS, ABB, Mul5-‐gate Temperature, oxide breakdown thresholding, Clock-‐ga5ng

Similarly, we define virtual actuations (e.g., application duty cycling, algorithmic choice, checkpointing) that are software/hardware interventions that can predictively influence system design objectives like performance, power, and reliability. Virtual actuations can be combined with physical actuation mechanisms commonly adopted in modern chips (e.g., DVFS and adaptive body biasing (ABB) to control the chip performance, power, and parametric variations); the notion of actuator fusion in CPSoC represents virtual and physical actuations that are combined across different layers

V. CPS O C D ESIGN T IME AND RUNTIME A DAPTATION P OLICIES We now discuss some of the adaptation and predictive control policies to illustrate how CPSoC can adaptively improve the performance and power as well as the lifetime of the chip. Table IV outlines sample adaptive control policies corresponding to cross-layer virtual sensing and actuation scheme as shown in Fig. 1. Using real-time aging monitors, CPSoC platforms can trigger adjustments in clock speeds or operating voltages to ensure that these systems work at


6

Table III V IRTUAL /P HYSICAL S ENSING AND ACTUATIONS ACROSS L AYERS Layers Application Operating System Network/ Bus Communication Hardware Architecture Circuit/Device

Virtual/Physical Sensors Workload, Power, Energy and Execution Time, Phases System Utilization and Peripheral States Bandwidth, Packet/Flit Status and Channel Status, Congestion Cache Misses, Miss Rate, Access Rate, IPC, Throughput, MLP Circuit Delay, Aging, Leakage Temperature, Oxide Breakdown

Virtual/Physical Actuators Loop Perforation, Approximation, Algorithmic Choice, Transformations Task Allocation, Partitioning, Scheduling Migration, Duty Cycling Adaptive Routing,Dynamic BW Allocation and Ch. no and Direction Control Cache & Issue-Width Sizing, Reconfiguration Resource Provisioning, Static/Dynamic Redundancy DVFS, ABB, Voltage Frequncy Island Clock Gating, Power Gating

Optimizing Loop Application Goals, Intents, and Hints

Optimization Polices

Predictive Models

adaptation

Reference Behaviors

System Priorities and Policies

Physical Sensors NoC

Noise Set Point Goals

±

Error

Virtual & Physical Actuation

Adaptive Controller

±

Virtual Sensors

CPSoC Fabric

Application Layer

Feedback

Traditional Control Loop

Figure 7. Multiple adaptation and predictive control loops guided by the goals, policies, and hints provided by the user. The runtime system uses these goals and hind to derive the system set point and constraints in CPSoC.

peak levels, even as they grow older (unlike the traditional conservative approach that would slow down the platform from the start). The wear and tear endured by the on-platform transistors are adaptively managed using self-healing policies. As the transistor degrades due to aging effects such as NBTI, the resulting effect is a shift in the threshold voltage and consequent delay change. The circuit delay prediction as above can be used to define a failure function that flags a failure when any critical path circuit delay goes beyond a threshold value. With the ability to virtually observe conditions of each core/block, the failure of each core/block can be quickly predicted to take corrective steps and avoid such failures. By suitably controlling the temperature as well as the stress and the rest period of the aging cores, the shift in the threshold voltage and the resulting degradation is substantially reduced.

In the following we specifically describe sample policies for: • Application-semantic driven design time polices • Variability-aware dynamic task allocation and scheduling • Dynamic actuation and actuation fusion • Runtime and root-cause guided actuation policies • Temperature control and adaptation for improved resilience

•

Self-healing policies for aging resilience

A. Application-Semantic Driven Design Time Policies Using semantic and application knowledge, we can construct and apply design/compile time rules and policies that improve the resilience and self-healing properties of applications. For example, the application can be analyzed to reduce vulnerable and critical functions and exposure to vulnerable execution durations by use of compiler transformations as shown in Fig. 8. We present a semantic annotation framework to support design time goals, intents, and hints from the programmer that is transformed to system level constraints and goals at runtime as shown in Fig. 7. We show how this design time hints from the programer can be exploit to improve system performance and power [56], [58], [59]. B. Variability-Aware Dynamic Task Allocation and Scheduling In this example of variability-aware adaptive task allocation and schedule (section VI.A) we overcome the limitation of the exiting OS allocation and scheduling approaches (e.g. Linux) in efficiently dealing with the joint problem of considering heterogeneity of cores, dynamic variations in workload, and semiconductor process variability. Due to the lack of predictive abilities and awareness of architecture workload as well as


7

Table IV A DAPTIVE C ONTROL P OLICIES Sensors Used

Diagnosis

Severity

Delay Sensor and Temperature Sensor

Thermal induced short term delay

1

Virtual workload sensor Temp. Sensor BTI sensor, delay sensor, Temp. Sensor

Workload resulting in thermal and power emergency

2

NBTI induced timing errors

3

Aging/TDDB sensor, BTI and Temp. Sensor

Impending failure

4

Immediate rest for healing

BIST, Aging/TDDB BTI & Temp. Sensor

Permanent failure

5

Reconfigure avoid block/core

Figure 8. Source code annotated deisgn time policies and compiler transformations to improve system resilience.

platform variability, exiting OS allocation and scheduling schemes for MPSoC are unable to make judicious decision in allocating tasks to appropriate cores. CPSoC overcomes these limitations through the use of introspective and predictive models to improve the system performance and power efficiency. In addition, CPSoC uses simple self-healing policies (e.g., clock or power gating of cores) is used to improve the lifetime of the chip without losing performance. The CPSoC platforms have the potential to address complex requirements of diverse applications by executing workloads (or tasks) in the most appropriate core types and meet competing and conflicting objectives (e.g., performance, throughput, power, energy, etc.) [60], [61], [62], [63]. Since different workloads (e.g. CPU bound, integer-intensive, floating-point intensive, memory intensive etc.) require different resources, some key issues are to determine and select the right types and number of cores (processing elements), allocation strategies that assign the workload (or tasks) to right core type, and the type of workload that will best benefit from the given platform. The selection of number and type of cores is not straightforward for diverse workloads with time varying program phases [64] and process induced variability in performance and power. CPSoC exploits the combined variability due to process and workloads and exposes them to the OS task allocation and scheduling to opportunistically make decisions at runtime during periodic epochs. The overview of CPSoC runtime is

Adaptive Control Policies Change the frequency proportional to delay to avoid thermal emergency Reduce power and computation resilient applications Adjust frequency and voltage and migrate workloads

Selection of Actuation Mechanism DFS (frequency control) loop perforation and approximate computation + DVFS DVFS, task migrations clock gate and power gating of the block, Adaptive body biasing Reconfiguration

shown in Fig. 9 and is comprised of following steps during periodic epochs as illustrated in Fig. 10. At every epoch, first it measures the power and the performance counters of each cores. Second, using the measurements, it estimates the individual impact or contribution of each task in the performance and power of the same core type. Third, the individual contribution per task in performance and power estimates of one core type are used to predict the per-task performance and power in other other core types. These pertask performance and power characteristics are represented in a systematic way using two 2-dimensional matrices. These matrices are then used in finding the allocation for the next epoch in the fourth step. Once the allocation is found, the tasks are scheduled by the OS in the last step. This process is repeated in each epoch as shown in Fig. 10. ……….. tn … ….….…

t5

t2

t4

t3

t1

Objec7ve(s)

Tasks

Configura7ons

Variability-‐Aware Perf Matrix Performance & Task Alloca7on & Power Es7ma7on ….. Scheduling & Predic7on Power Matrix

Performance Counters & Power & Variability Sensing

A7: Small A15: Big A11: Medium

A7 A7 A7 A7 A7 A7

A7 A7 A7 A7 A7 A7

A7 A7 A7 A7 A7 A7

A11

A11

A11

A11

A11

A11

A11

A11

Alloca7on Opportunis7c Alloca7on Decision

A15

A15

LLC

Figure 9. Dynamic Task Allocation and Scheduling in Strongly heterogeneous MPSoC [65].

We assume that tasks are spawned at the beginning of execution and that allocation and mapping decisions are taken at periodic time intervals or epochs. During the epoch, the number of tasks are assumed to be fixed since the granularity of the task spawning rate can be adapted to that of the epoch. Our initial experimental results with variability and heterogeneity-aware Linux kernel show an improvement over ~20% in system efficiency (performance per Watt) in


Measurement Es,ma,on & Predic,on

TA

Alloca,on Scheduling (CFS)

Scheduling (CFS)

TS

Scheduling (CFS) TEpoch

Epoch 1

Epoch 2

Epoch 3

Figure 10. CPSoC’s Runtime Timings Epochs for Variability-Aware Dynamic Task Allocation and Scheduling [65].

comparison to the standard Linux kernel with only workload variability and ~35-50% when process variability is jointly considered for a quad-core machine [65]. C. Dynamic Actuation and Actuator Fusion Actuator fusion combines different types of actuators across different layers of the system stack into virtual actuators. By fusing actuation mechanisms from the highest levels (e.g., application and software) together with lower layers (OS, hardware, and circuit layers), we can exploit wider opportunities for controlling system QoS properties (e.g., performance, power, thermal, and reliability guarantees) [57]. Actuator fusion affords several advantages: graceful degradation of system performance in the face of actuation mechanism failures; faster response and more dynamic range of actuation; use of simpler (low-overhead) actuation mechanisms for virtual actuation; and repurposing of different actuation mechanism to overcome specific constraints and malfunctions. Actuator fusion can be used to guarantee runtime system QoS (performance, reliability, power, thermal behavior) even in a highly dynamic environment. As a specific example, consider a many core architectural stack executing heterogeneous applications/workloads (for example probabilistic/RMS and safety-critical workloads). Also consider three actuation mechanisms that are traditionally applied independently to improve performance within the power consumptions goals: loop perforations in the application layer, task allocation in OS layer, and DVFS in the hardware layer. The fusion of these three actuation mechanisms (loop perforation, task allocation, and DVFS) enables aggressive power savings that would not be possible beyond a critical voltage and frequency with simple DVFS. Once this critical limit is reached, any further reduction in voltage/ frequency results in exponential degradation in system reliability and error characteristics. Loop perforation in the actuator fusion framework enables additional power saving without jeopardizing the system reliability and dependability. This additional dimension of control, provides added flexibility in improving the range and speed of qualities of service and experience of computing platforms [57]. D. Runtime & Root-Cause-Guided Actuation Policies Run-time policies can deploy actuation decisions that exploit the virtual sensors. Using sensor fusion, different types

8

of sensors can be used to enable root-cause diagnosis. As discussed earlier with sensor fusion, different reliabilityrelated phenomena can lead to the same observable changes. Sensor fusion with different types of sensors can be used for root-cause diagnosis. The actuation policies should use the diagnosis results from the virtual sensors to enable the appropriate root-cause-guided actuation. For example, if the root cause diagnosis indicates delay changes (identifying either BTI or soft breakdown), the system should apply appropriate actuation mechanisms: for BTI-induced delay changes, both voltage scaling and frequency scaling can be applied; whereas for soft breakdown, the system should migrate the workload and avoid using the circuit with this impending failures [57]. E. Temperature Control and Adaptation for Improved Resilience in CPSoC The temperature of the chip significantly affect the reliability and failure rate of the chip [22], for example, a mere 15o C rise in the operating temperature could halve the life span of the chip [66]. We illustrate improvement in resilience in terms of overall mean-time-to-failure (MTTF) by reducing the average core temperatures along with precise temperature control of the cores. At the hardware architecture layer we adopt a heterogenous architecture of small and big cores, and select a thermal-aware floor plan that is power and energy-efficient for a relatively diverse set of applications in comparison to a homogenous architecture as shown in Fig. 11. The thermal profile comparison for both the architectures with same benchmarks is shown in Fig.12. Using CPSoC’s adaptive and reflective middleware layer, we employ a neural network based system identification and model predictive control [67] as in Fig. 7 to model the CPSoC behavior with the aim to precisely control and adapt to the system dynamics. Previous work has shown in [68] that a 1o C temperature measurement and control accuracy can translate to 2W power savings in general purpose computers while 1.5o C accuracy results in 1W saving in mobile computers. Consequently, we propose precise control schemes by using the dynamical relationship of the thermal dissipation phenomena on-chip, modeled using a equivalent RC network as in HotSpot [22], and exploit that dynamical relationship to compute states that are directly not measured by a sensor using virtual sensing technique as describe in subsection VI-E[51]. We construct observers/ Kalman Filter based virtual sensors [51] for the heterogeneous CPSoC architecture to provide the estimation and prediction of unmeasured parameters and states as shown in Fig. 7. While the inner control loop with virtual sensor (marked in blue) is more close to the classical control system, the outer loop is the on-line learning based controller adopting a neural net as in the model predictive control [69], [67]. In addition to this control strategy, we apply actuation fusion of task migration, dynamic voltage and frequency scaling (DVFS), and clock and power gating to show the benefits. The effect of adaptation on the precise control of the temperature of a core with actuator fusion (task migration and DVFS) in comparison to only the migration based approach (e.g., throttling) is shown Fig. 13. It is observed that the actuator fusion based approach can provide precise and aggressive response in achieving the goals [57].


9

(a)

(b)

(c)

(d)

Figure 12. Thermal profile with MPSoC and CPSoC for same benchmarks with lesser average temperature in CPSoC. The peak temperature of a sensor-less core is predicted by virtual sensing [51].

Figure 13. Effect of actuator fusion in adaptation and control in CPSoC (a) temperature control with task migration only (a) joint DVFS and task migration [57].

Furthermore, we address the the mean time to failure (MTTF) of the CPSoC. Consider the MTTF of metal wire interconnects in chips when subjected to electromigration [70] that follows an exponential relationship with temperature as in (1) 1 Ac Ea M T T FEM = = KEM . n .exp (1) λEM J k.T where λEM is the failure rate due to electromigration, KEM is a technology dependent constant, Ac is the cross-sectional area, J is the current density, n is the scaling factor usually (1.1 to 2), Ea is the activation energy , k is the Boltzmann constant, and T is the absolute temperature in degree Kelvin.

Since the failure of a core in CPSoC does not mean the failure of the complete chip, we compute the MTTF of individual cores and average them across the whole fabric. The improvement in overall MTTF averaged across all the cores running the same SPEC benchmarks is shown in Fig. 14. It is observed that the MTTF improves across all the benchmark from 2 - 5 years with simple runtime adaptation policies the details of which are dicussed in [71]. F. Self-healing Policies for Improving Aging Resilience With the availability of real-time aging monitors, CPSoC can trigger adjustments in clock speeds or operating voltages


10

Figure 14. Overall MTTF averaged across all the cores for SPEC benchmarks.

as shown in Fig. 15(d) (green plot). VI. CPS O C A RCHITECTURE OVERHEADS Since the CPSoC platform introduces several new hardware and architectural modules for on-chip sensing, analysis, and actuation, in this section we analyze the overheads incurred by these architectural assists. Figure 11. Floor plan example of a) 4-core MPSoC b) 20 core heterogenous architecture of CPSoC based on Alpha EV6, EV4 processors.

to ensure that these chips work at peak levels, even as they grow older, which would eliminate the need for slowing the chips down from the start. The wear and tear endured by the transistor in the chip are adaptively managed using self-healing policies. The relationship between the circuit delay, threshold voltage, supply voltage, and temperature is related by the alpha power law given by [51] : delay = τ = KD

VD T µ λ

(VD − VT H )

A. Processor Cores, CoreNoC Architecture and Overheads The core to core NoC (cNoC) is an integral part of the tiled architectures of CPSOC. We study the overhead of two cNoC configuration for a typical 16 core platform. Fig. 16 shows two 4x4 mesh and mesh with edge enabled architectures. The specifications and resource used by the network is tabulated in Table. V which is found to be less than 3% of the cores in area and power for both the networks.

(2)

where KD is technology specific constant, µ, and λ are empirical constants for technology node. As transistor degrades due to aging effects like NBTI [72], [70], the resulting effect is a shift in the threshold voltage VT H as shown in Fig. 15(c). The circuit delay prediction as above can be used to define a yield function such that the circuit delay beyond a threshed value in any critical path would result in a failure. With the ability to virtually observe conditions[51] of each core, the failure of each core can quickly be predicted to take corrective steps and avoid such failures. Fig. 15 shows the effect of temperature on circuit performance degradation and aging. By suitably controlling the temperature as well as the stress and the rest period of the aging cores, the shift in the threshold voltage and the resulting degradation is substantially reduced

Figure 16. Core-to-core communication NoC (cNoC) topology.

In order to provide adaptivity at the networking layer, we provide architecture features to adapt the cNoC bandwidth, channel allocation, channel direction, and routing tables from software as shown in Fig. 18. We introduce a software-define adaptive NoC that is capable of adjusting the crucial network parameters at runtime. This additional adaptation mechanisms introduces moderate overhead of up to ~10% w.r.t to the cNoC the details of which are investigated in [57].


11

Figure 15. Impact of temperature variations and circuit aging on circuit delay and threshold voltage. Self-healing policies can reduce the impact of aging in CPSoC for same operating temperature. Table V R ESOURCE UTILIZATION AND OVERHEADS OF C N O C

B. On-chip Sensors and Sensor Network (sNoC) An low-overhead, low-complexity dedicated approach for the collection and use of monitoring data for diagnosis, detection, debugging, and quality of service (power, performance, reliability, fault tolerance, security) is essential for resilient execution of applications on large multicore platforms. Sensor networks in emerging large scale embedded and computing systems are essential and to some extent are already exist (albeit in ad hoc manner) for reading critical system variables such as temperature in addition to the traditional network-onchip [13]. Even though recent trends suggest that on-chip monitoring is a very promising foundation for investigating internal and environmental effects, building systems cost-effectively as well as in a predictable manner is a major engineering

challenge. Specifically, on-chip sensor networks will need to consider a scalable topology, placement of network managers and sensors, sensing and network efficiency, colocation/coexistence with on-chip data networks and interacting interfaces for bandwidth, congestion, and area-power efficiency. The requirements from such a sensor network (see Figure 5) will depend on sensing and actuation latency demands, overheads compared to simple design margining schemes, the number of sensors and sensor fusion schema. These will dictate suitable topologies and signaling mechanisms. As the number of sensors are large, we consider a high fanout network such as Aggregation tree as one of the promising topology as shown in Fig. 19a and Fig. 19b. The specification and resource usage of these two network topology is tabulated in Table. VI and


12

Figure 17. Mesh and Mesh with edge enabled core-to-core NoC (cNoC) overhead. Software-controlled

Clk Select

Clk Divider

Figure 20. Aggregation Tree (AT) based sNoC overheads with respect to cortex A9 cores of the platform.

Sensing

Channel Allocated Channel Allocator

Direction Control

Switch Allocator (SA)

Phit Count

16

VC 1 VC 2

64 Buffer States

Serializer

16

VC Arbitor (VA)

Output Buffer

16 16

Routing Computation (RC)

VC 2

Phits Input

Deserializer

Input Buffer

simplified custom circuit switched architecture as shown in Fig. 21 for the sNoC. We consider two candidate topologies (ring and star) for the custom based on the cycle latency. If the cycle latency is comparable to the number of sensors, ring topology may be sufficient. On the other hand, if a low latency is desired the star network as shown in Fig. 22 is better candidate.

VC 1

Flits Input

VC 2

Select

Channel P VC 2

Crossbar

Sensors

Channel Control

1

Figure 18. A software-defined adaptive cNoC with bandwidth, channel direction, and routing policy control.

the overhead for 16 and 32 node sNoC in comparison to the platform core (16 A9 cores) is shown in Fig.20. This overhead compared 16 x A9 cores is relatively higher as the number considering the fact that the number of sensors expected in these platform is going to be higher. Consequently, we explore other network architecture that are relatively more resource efficient in the subsequent section.

(a) 16 Node Aggregation Tree.

(b) 32 Node Aggregation Tree. Figure 19. Aggregation Tree based sNoC topology.

1) Custom Circuit Switched sNoC: Many of the on-chip sensor are low bandwidth sensors and thus the sensor data can be time division multiplex over a router. However, as packet switching introduces considerable overhead, we consider a

2

N

Mux (N:1) Controller (FSM)

Figure 21. Custom TDM Router for sNoC.

We implemented a circuit switched ring topology sNoC as shown Fig. 5in for 16 and 32 channels at each node/ router. The channels are time division multiplexed (TDM) for low bandwidth (slow varying phenomena) sensors such as aging. The comparison of the area and power with respect to the packet switched aggregation tree based sNoC is tabulated in VII. It is observed that the custom circuit switched sNoC can save area and power by two orders of magnitude. We also show that that the overhead of the networks (packet switched and circuit switched) for the sNoC. For a Aggregation tree based network of 32 node, the overhead is less that 3 % w.r.t 16xA9 platform cores. This is drastically reduced (268x in area and 89x in power) when a custom circuit switched network is used. The resource utilization of the coreto-core network (cNoC) of the platform is shown in Table. V. Based on the prototype implementation, we find that the cNoC consumes less than 3% area and power for 4x4 mesh and extended edge architectures.


13

Table VI R ESOURCE U TILIZATION AND OVERHEADS OF AGGREGATION T REE BASED S N O C

Table VII C IRCUIT S WITCHED TDM S N O C A REA P OWER E STIMATION SL No

Tech Node

Channel No

Channel Width

Area

Power

1 2

32nm 32nm

16 32

8 8

2365.287712 µm2 4.75e3 µm2

33.5115 µW 68.01 µW

% Saving w.r.t respective packet switched sNoC Area Power 127.7955× 268.4200×

47.1569× 89.6809×

Figure 22. Star topology for circuit switched sNoC. The channels are time division multiplexed (TDM).

C. On-chip Sensors and Overheads Figure 23. Thermal sensors % overhead w.r.t. platform A9 cores.

Todays computing platforms are already equipped with significant number of sensors. For example, IBM Power 7 [73] processor has over 40 distributed thermal sensors, processor core and memory activity counters, off-chip current and voltage sensors. These sensors include physical sensors such as delay sensors [74], [75], [76], voltage and power sensors [77], [78], [79], temperature sensors [80], [81], [82], [83], and reliability sensors [84], [85], [86], [73]. Sensors can also be implemented at higher level of system stack such as architecture performance counters [87] and NoC traffic/congestion monitors [88]. Most existing sensors are designed solely to monitor specific phenomena. On the other hand, CPSoC expands the role of native sensors by multi-purposing sensors and reducing sensing overheads through extensive use of sensor fusion rather than the development of new sensors. The list and specification different types of on-chip sensors are tabulated in Table. VIII through Table. XVI. We show the overheads of the using different number of sensors w.r.t to the platform cores. Based on the analysis we show that the overhead of including the large number (for 1000 sensors of 5 types) is less that 7.3 % of the area and 1.6% of the power with respect to 16 A9 platform cores. The power impacts of adding over thousands of sensors are shown in Fig.27 and is found to be less than 0.3%. The thermal impact of adding the sensors are marginal as the power contribution to the overall architecture is less that 0.3% for a 16-core architecture.

Figure 24. Power sensors % overhead w.r.t to platform A9 cores.

Note that since most of the sensors considered in the CPSoC platform can be implemented using the standard CMOS process, there are no special technology/manufacturing requirements. Mixed signal design can be used to include specialized sensors, however virtualizing and fusing several sensors can avoid such specialized sensors needs for CPSoC. For instance, although the power sensors (and few types of thermal sensors) use mixed signal design requiring an OpAmps, ADC etc., we overcome these limitations of process and custom design requirements by using virtual sensing as explained in Section VI-E. We use simple ring oscillators


14

Table VIII R ECENT S MART O N -C HIP T EMPERATURE S ENSORS Sensor

Technology

Area(mm2 )

Power

Accuracy(0 C)

Range(0 C)

Resolution(0 C)

Reference

TS1 TS2 TS3 TS4 TS5 TS6

65nm 0.35µm 0.35µm 0.7µm 65 nm 45nm

0.01 0.175 0.6 4.5 0.1

150µW @1.0V [email protected] 36.7µW @3.3V 247µW @3.3V 10µW @1.2V 250-360 µW

-5.1~+3.4 -0.7~+0.9 -0.25~+0.35 ±0.1 ±0.2 ±1

0~60 0~100 0~90 -55~125 -70~125

0.139 0.16 0.0918 0.01 0.03

[89] [90] [91] [92] [93] [94]

Table IX L EAKAGE S ENSORS S PECIFICATIONS SL No

Tech Node

Area

Power

Accuracy

Range

Resolution & Rate

Reference

LS1 LS2

45nm/1.2V (SOI) 90nm/1.2V

140x140 µm2 83x73 µm2

120 uW 0.66mW @80o C

±10 % ±3%

27-100o C/ load 0-5mA

10uA/0.5us

[95] [96]

Table X O N -C HIP P OWER S ENSORS SL No

Tech Node

Area

Power

Accuracy

Range

Resolution & Rate

Reference

PS1 PS2 PS3

45nm/1.2V (SOI) 90nm/1.2V 0.13 µm

140x140 µm2 83x73 µm2 0.01mm2

120 uW 0.66mW @80o C 180 µA@1V 180µW

±10 % ±3%

27-100o C/ load 0-5mA

10uA/0.5us

0-50mA/-23-100o C

per 80ns

[95] [96] [97]

Table XI O N -C HIP NBTI AGING S ENSORS SL No

Tech Node

Area

Power

NS1

45nm/1.1V

148.0826*10−6 mm2

18.57 µW (stress mode) 30.86 µW (meas. Mode) 12 µW 12.2 µW 15 µW

NS2 NS3 NS4 NS5 NS6 NS7 NS8

−6

2

45nm 45nm 45nm 65nm /1.2 V 130nm /1.2 V 130nm/1.2V/3.3V

60*10 mm 78*10−6 mm2 62*10−6 mm2 38.04*10−6 mm2 33792*10−6 mm2 150*10−6 mm2

130nm

−6

308*10

mm

2

Accuracy

Range

Resolution & Rate

469.5 µW 500 nW (stress mode) 4.5 nW (meas. Mode)

Figure 25. Voltage sensors % overhead w.r.t to platform A9 cores.

Reference [98] [99] [100] [101] [102] [103] [104], [105] [106]

Figure 26. NBTI sensors % overhead w.r.t to platform A9 cores.

D. SensorNoC Architecture and Overheads (RO) based delay sensors as a proxy for different sensors (e.g. temperature and power) and accurately estimates their values while saving substantial sensor area, power, design complexity and cost. The virtual sensing approach described in Section VI-E uses simple temperature sensor to estimate the power of the full chip at run-time. Therefore, the CPSoC paradigm enables intelligent choice of sensors along with virtual sensing and sensor fusion for different design concerns.

Since CPSoCs are sensor-actuator rich platforms, the information generated by these sensors need to be aggregated and processed. CPSoC’s introspective sentient units (ISU) serve these functions by collecting, monitoring, and processing the sensing data and make meaning about the system’s present states and context as well as future states. Each ISU is a processors based system that is interfaced to the sNoC where the virtual sensing approach is performed. The information


15

Table XII O N -C HIP TBDI AGING S ENSORS 38.04*10-6 mm^2SL No

Tech Node

Area

TB1 TB2

65nm /1.2/2.5V 130nm/1.2 V

38040 *10−6 mm2 214x551 µm2 20x20 Array 555x225*10−6 mm2

Power

TB3

130nm

150*10−6 mm2

Accuracy

Range

Resolution & Rate

Reference