Deterministic Memory Abstraction and Supporting Multicore ... - DROPS

Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi University of Kansas, USA [email protected]

Prathap Kumar Valsan Intel, USA [email protected]

Renato Mancuso Boston University, USA [email protected]

Heechul Yun University of Kansas, USA [email protected]

Abstract Poor time predictability of multicore processors has been a long-standing challenge in the realtime systems community. In this paper, we make a case that a fundamental problem that prevents efficient and predictable real-time computing on multicore is the lack of a proper memory abstraction to express memory criticality, which cuts across various layers of the system: the application, OS, and hardware. We, therefore, propose a new holistic resource management approach driven by a new memory abstraction, which we call Deterministic Memory. The key characteristic of deterministic memory is that the platform–the OS and hardware–guarantees small and tightly bounded worst-case memory access timing. In contrast, we call the conventional memory abstraction as best-effort memory in which only highly pessimistic worst-case bounds can be achieved. We propose to utilize both abstractions to achieve high time predictability but without significantly sacrificing performance. We present deterministic memory-aware OS and architecture designs, including OS-level page allocator, hardware-level cache, and DRAM controller designs. We implement the proposed OS and architecture extensions on Linux and gem5 simulator. Our evaluation results, using a set of synthetic and real-world benchmarks, demonstrate the feasibility and effectiveness of our approach. 2012 ACM Subject Classification Computer systems organization → Real-time systems Keywords and phrases multicore processors, real-time, shared cache, DRAM controller, Linux Digital Object Identifier 10.4230/LIPIcs.ECRTS.2018.1 Acknowledgements This research is supported by NSF CNS 1718880. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF. © Farzad Farshchi, Prathap Kumar Valsan, Renato Mancuso, and Heechul Yun; licensed under Creative Commons License CC-BY 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Editor: Sebastian Altmeyer; Article No. 1; pp. 1:1–1:25 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany

1:2

Deterministic Memory Abstraction and Supporting Multicore System Architecture

1

Introduction

High-performance embedded multicore platforms are increasingly demanded in cyber-physical systems (CPS)–especially those in automotive and aviation applications–to cut cost and to reduce size, weight, and power (SWaP) of the system via consolidation [31]. Consolidating multiple tasks with different criticality levels (a.k.a. mixed-criticality systems [58, 8]) on a single multicore processor is, however, extremely challenging because interference in shared hardware resources in the memory hierarchy can significantly alter the tasks’ timing characteristics. Poor time predictability of multicore platforms is a major hurdle that makes their adoption challenging in many safety-critical CPS. For example, the CAST-32A position paper by the avionics certification authorities comprehensively discusses the certification challenges of multicore avionics [9]. Therefore, in the aerospace industry, it is a common practice to disable all but one core [28], because extremely pessimistic worst-case-execution times (WCETs) nullify any performance benefits of using multicore processors in critical applications. This phenomenon is also known as the “one-out-of-m” problem [27]. There have been significant research efforts to address the problem. Two common strategies are (1) partitioning the shared resources among the tasks or cores to achieve spatial isolation and (2) applying analyzable arbitration schemes (e.g., time-division multiple access) in accessing the shared resources to achieve temporal isolation. These strategies have been studied individually (e.g., cache [25, 59, 39], DRAM banks [38, 63], memory bus [64, 40]) or in combination (e.g., [27, 51]). However, most of these efforts improve predictability at the cost of a significant sacrifice in efficiency and performance. In this paper, we argue that the fundamental problem that prevents efficient and predictable real-time computing on multicore is the lack of a proper memory abstraction to express memory criticality, which cuts across various layers of the system: the application, OS, and hardware. Thus, our approach starts by defining a new OS-level memory abstraction, which we call Deterministic Memory. The key characteristic of deterministic memory is that the platform–the OS and hardware–guarantees small and tightly bounded worst-case memory access timing. In contrast, we call the conventional memory abstraction as best-effort memory in which only highly pessimistic worst-case bounds can be achieved. We propose a new holistic cross-layer resource management approach that leverages the deterministic and best-effort memory abstractions. In our approach, a task can allocate either type of memory blocks in its address space, at the page granularity, based on the desired WCET requirement in accessing the memory blocks. The OS and hardware then apply different resource management strategies depending on the memory type. Specifically, predictability focused strategies, such as resource reservation and predictable scheduling, shall be used for deterministic memory while average performance and efficiency-focused strategies, such as space sharing and out-of-order scheduling, shall be used for best-effort memory. Because neither all tasks are time-critical nor all memory blocks of a time-critical task are equally important with respect to the task’s WCET, our approach enables the possibility of achieving high time predictability without significantly affecting performance and efficiency through the selective use of deterministic memory. While our approach is a generic framework that can be applied to any shared hardware resource management, in this paper, we particularly focus on the shared cache and main memory, and demonstrate the potential benefits of our approach in the context of shared cache and DRAM related resource management. First, we describe OS extensions and an OS-level memory allocation method to support deterministic memory. We then describe a

F. Farshchi, P. K. Valsan, R. Mancuso, and H. Yun

1:3

deterministic memory-aware cache design that provides the same level of cache space isolation of the conventional way-based partitioning techniques, while achieving significantly higher cache space utilization. We also describe a deterministic memory-aware DRAM controller design that extends a previously proposed real-time memory controller [55] to achieve similar predictability benefits with minimal DRAM space waste. We implement the deterministic memory abstraction and an OS-level memory allocator (replacing Linux’s buddy allocator) in a Linux 3.13 kernel and implement the proposed deterministic-memory aware memory hierarchy hardware extensions (in MMU, TLB, cache and DRAM controller) in a gem5 full-system simulator [7] modeling a high-performance (outof-order) quad-core platform as the baseline. We evaluate the system using a set of synthetic and real-world benchmarks from EEMBC [14], SD-VBS [57] and SPEC2006 [19] suites. We achieve the same degree of isolation with conventional way-based cache partitioning for real-time tasks while improving the cache hit rate of co-scheduled non-real-time workloads by 39% on average. In addition, we need significantly less memory space in reserved DRAM banks, while achieving comparable WCET guarantees compared with a state-of-the-art real-time DRAM controller. The main contributions of this work are as follows: We propose a new OS-level memory abstraction, which we call Deterministic Memory, that enables efficient cross-layer resource management, balancing time predictability and resource efficiency. We present a concrete system design–from the OS down to the entire memory hierarchy, including shared cache and DRAM controller designs–that demonstrate the potential benefits of the new memory abstraction. The key contribution of our design is its Memory Management Unit (MMU) based approach that provides flexible, fine-grained (page-granularity) resource management across the entire memory hierarchy. We implement a realistic prototype system on a Linux kernel and a cycle-accurate full system simulator. 1 We also provide extensive empirical results, using both synthetic and real-world benchmarks, that demonstrates the effectiveness of our approach. The remainder of the paper is organized as follows. Section 2 provides background and motivation. Section 3 describes the proposed Deterministic Memory abstraction. Section 4 provides an overview of the deterministic memory-aware system design. Section 5 presents DM-aware timing analysis. Section 6 details our prototype implementation. Section 7 presents evaluation results. We review related work in Section 8 and conclude in Section 9.

2

Background and Motivation

In this section, we describe why the standard uniform memory abstraction is a fundamental limitation for the development of efficient and predictable real-time computing infrastructures. CPU-centric Abstractions and Resource Management. Traditionally, the CPU has been the main focus of resource management in real-time systems. This is because, in a unicore processor, only one task at a time can access the entire memory hierarchy and that CPU scheduling decisions have a predominant impact on the response time of real-time tasks. Therefore, CPU-centric abstractions such as core, task and task priority have been the primary focus of resource management. However, in multicore platforms, which have become

1

We provide the modified Linux kernel source, the modified gem5 simulator source, and the simulation methodology at http://github.com/CSL-KU/detmem for replication study.

ECRTS 2018

1:4


mainstream over the last decade, extensive inter-core hardware resource sharing in the memory hierarchy heavily impacts task timing. Hence, CPU time management is no longer the sole dimension to explore when reasoning about the temporal behavior of a system. Various OS and hardware-level solutions have been proposed to manage shared resources in the memory hierarchy with the goal of improving time predictability (we will provide a comprehensive review of related work in Section 8). Nonetheless, in most approaches, CPU-centric abstractions are still most widely used to perform allocation and scheduling of shared resources in the memory hierarchy. Unfortunately, CPU-centric abstractions are often too coarse-grained to enact accurate management policies on memory hierarchy resources, such as cache lines and main memory pages. For instance, when a fraction of cache space is reserved for a task, it cannot be used by other tasks, even if it is not fully utilized by the reserved task. Likewise, when DRAM banks are reserved for a task, they cannot be utilized by other tasks, resulting in under-utilized DRAM space, even though not all memory of the task may need to be allocated on the reserved DRAM bank. The Uniform Memory Abstraction. Operating systems and hardware traditionally have provided a simple uniform memory abstraction that hides all the complex details of the memory hierarchy. When an application requests to allocate more memory, the OS simply maps the necessary amount of any physical memory pages available at the time to the application’s address space–without considering: (1) how the memory pages are actually mapped to the shared hardware resources in the memory hierarchy, and (2) how they will affect application performance. Likewise, the underlying hardware components treat all memory accesses from the CPU as equal without any regard to differences in criticality and timing requirements in allocating and scheduling the requests. We argue that this uniform memory abstraction is fundamentally inadequate for multicore systems because it prevents the OS and the memory hierarchy hardware from making informed decisions in allocating and scheduling access to shared hardware resources. As such, we believe that new memory abstractions are needed to enable both efficient and predictable real-time resource management. It is important to note that the said abstractions should not expose too many architectural details about the memory hierarchy to the users, to ensure portability in spite of rapid changes in hardware architectures.

3

Deterministic Memory Abstraction

In this section, we introduce the Deterministic Memory abstraction to address the aforementioned challenges. We define deterministic memory as special memory space for which the OS and hardware guarantee small and tightly bounded worst-case access delay. In contrast, we call conventional memory as best-effort memory, for which only highly pessimistic worst-case bounds can be achieved. A platform shall support both memory types, which allow applications to express their memory access timing requirements in an architecture-neutral way, while leaving the implementation details to the platform–the OS and the hardware architecture. This, in turn, enables efficient and analyzable cross-layer resource management, as we will discuss in the rest of the section. Figure 1 shows the conceptual differences between the two memory types with respect to worst-case memory access delay bounds. For clarity, we divide memory access delay into two components: inherent access delay and inter-core interference delay. The inherent access delay is the minimum necessary timing in isolation. In this regard, deterministic memory can


1:5

legend Inter-core interference

Worst-case memory delay

Inherent timing

Best-effort memory

Deterministic memory

Figure 1 Conceptual differences of deterministic and best-effort memories.

Table 1 Differences in resource management strategies.

Deterministic memory Best-effort memory

Space allocation Dedicated resources Shared resources

Request scheduling Predictability focused Performance focused

WCET bounds Tight Pessimistic

be slower–in principal, but not necessarily–than best-effort memory, as its main objective is predictability and not performance, while in the case of best-effort memory, the reverse is true. The inter-core interference delay is, on the other hand, an additional delay caused by concurrently sharing hardware resources between multiple cores. This is where the two memory types differ the most. For best-effort memory, the worst-case delay bound is highly pessimistic mainly because the inter-core interference delay can be severe. For deterministic memory, on the other hand, the worst-case delay bound is small and tight as the inter-core interference delay is minimized by the platform. Table 1 shows general spatial and temporal resource management strategies of the OS and hardware to achieve the differing goals of the two memory types. Here, we mainly focus on shared hardware resources, such as shared cache, DRAM banks, memory controllers, and buses. In contrast, we do not focus on core-private hardware resources such as private (L1) caches as they do not generally contribute to inter-core interference. In the deterministic memory approach, a task can map all or part of its memory from the deterministic memory. For example, an entire address space of a real-time task can be allocated from the deterministic memory; or, only the important buffers used in a control loop of the real-time task can be allocated from the deterministic memory, while temporary buffers used in the initialization phase are allocated from the best-effort memory. Our key insight is that not all memory blocks of an application are equally important with respect to the application’s WCET. For instance, in the applications we profiled in Section 7.1, only a small fraction of memory pages account for most memory accesses to the shared memory hierarchy (shared cache and DRAM). Based on this insight, we now provide a detailed design and implementation of deterministic memory-aware OS and architecture extensions with a goal of achieving high efficiency and predictability.

ECRTS 2018

1:6


Core1 I

Deterministic memory

Core2

D

I

W 1

Deterministic

W 2

Best-effort

Best-effort memory

B 1

B 2

Core3

D

B 3

I

W W 3 4 Cache ways B 4

B 5

Core4

D

B 6

I

D

W 5

B 7

B 8

DRAM banks Deterministic Memory-Aware Memory Hierarchy

(a) Application view (logical).

(b) System-level view (physical).

Figure 2 Logical and physical mappings of the deterministic and best-effort memory abstractions. W 4

4

System Design

In this section, we first provide a high-level overview of a deterministic-memory based multicore system design (Section 4.1). We then describe necessary small OS and hardware architecture extensions to support the deterministic/best-effort memory abstractions (Section 4.2). Lastly, we describe deterministic memory-aware cache and DRAM management frameworks (Section 4.3 and 4.4, respectively).

4.1

Overview

Figure 2a shows the virtual address space of a task using both deterministic and best-effort memory under our approach. From the point of view of the task, the two memory types differ only in their worst-case timing characteristics. The deterministic memory is realized by extending the virtual memory system at the page granularity. Whether a certain page is deterministic or best-effort memory is stored in the task’s page table and the information is propagated throughout the shared memory hierarchy, which is then used in allocation and scheduling decisions made by the OS and the memory hierarchy hardware. Figure 2b shows the system-level (OS and architecture) view of a multicore system supporting the deterministic and best-effort memory abstractions. In this example, each core is given one cache way and a DRAM bank which will be used to serve deterministic memory for the core. One cache way and four DRAM banks are assigned to the best-effort memory of all cores. Here, the highlighted deterministic memory-aware memory hierarchy refers to hardware support for the deterministic memory abstraction. It is important to note that the support for deterministic memory is generally more expensive than that of best-effort memory in the sense that it may require dedicated space, which may be wasted if under-utilized, and predictability focused scheduling, which may not offer the highest performance. As such, to improve efficiency and performance, it is desirable to use as little deterministic memory as possible as long as the desired worst-case timing of real-time tasks can be satisfied.


Vaddr

MMU/TLB

Paddr, DM

Paddr, DM

LLC

1:7

MC

DRAM cmd/addr.

Paddr, DM

hardware software Page table entry DM bit=1|0

OS

Figure 3 Deterministic memory-aware memory hierarchy: Overview.

4.2

OS and Architecture Extensions for Deterministic Memory Abstraction Support

The deterministic memory abstraction is realized by extending the OS’s virtual memory subsystem. Whether a certain page has the deterministic memory property or not is stored in the corresponding page table entry. Note that in most architectures, a page table entry contains not only the virtual-to-physical address translation but also a number of auxiliary attributes such as access permission and cacheability. The deterministic memory can be encoded as just another attribute, which we call a DM bit, in the page table entry. 2 The OS is responsible for updating the DM bits in each task’s page tables. The OS provides interfaces for applications to declare and update their deterministic/best-effort memory regions at the page granularity. Any number of memory regions of any sizes (down to a single page) within the application’s address space can be declared as deterministic memory (the rest is best-effort memory by default). In a modern processor, the processor’s view of memory is determined by the Memory Management Unit (MMU), which translates a virtual address to a corresponding physical address. The translation information, along with other auxiliary information, is stored in a page table entry, which is managed by the OS. Translation Look-aside Buffer (TLB) then caches frequently accessed page table entries to accelerate the translation speed. As discussed above, in our design, the DM bit in each page table entry indicates whether the page is for deterministic memory or for best-effort memory. Thus, the TLB also stores the DM bit and passes the information down to the memory hierarchy. Figure 3 shows this information flow of deterministic memory. Note that bus protocols (e.g., AMBA [2]) also should provide a mean to pass the deterministic memory information into each request packet. In fact, many existing bus protocols already support some forms of priority information as part of bus transaction messages 3 . These fields are currently used to distinguish priority between bus masters (e.g., CPU vs. GPU vs. DMA controllers). A bus transaction for deterministic memory can be incorporated into these bus protocols, for example, as a special priority class. The deterministic memory information can then be utilized in mapping and scheduling decisions made by the respective hardware components in the memory hierarchy. In the following, we focus on cache and DRAM controllers and how the deterministic memory information can be utilized in these important shared hardware resources.

2 3

In our implementation, we currently use an unused memory attribute in the page table entry of the ARM architecture; see Section 6 for details. For example, ARM AXI4 protocol includes a 4-bit QoS identifier AxQOS signal [2] that supports up to 16 different priority classes for bus transactions.

ECRTS 2018

1:8


Core 0 partition Way 0

Way 1

Core 1 partition Way 2

Way 3

shared partition Way 4

DM

Tag

Line data

0

Set 0

Set 1

deterministic line (Core0, DM=1)

Set 2

deterministic line (Core1, DM=1)

Set 3

best-effort line (any core, DM=0)

PartMask  ways of the given partition IgnMask  ways to ignore DM checking Figure 4 Deterministic memory-aware cache management. // Allocating a deterministic line If DM==1: victim = LRU(PartMask) DetMask |= 1

Deterministic Memory Abstraction and Supporting Multicore ... - DROPS

Deterministic Memory Abstraction and Supporting Multicore ... - DROPS

Suggest Documents

Deterministic Synchronization in Multicore ... - Semantic Scholar

Transactional Memory - DROPS - Schloss Dagstuhl

Visual Memory Decay Is Deterministic

Acquisition of Deterministic Exploration and Purposive Memory

OpenCL for programming shared memory multicore CPUs

Software Transactional Memory for Multicore ... - Semantic Scholar

Workload Adaptive Shared Memory Multicore ... - Deming Chen

MAKING DYNAMIC MEMORY ALLOCATION STATIC TO ... - DROPS

Supporting Information 3. Host-parasite simulations Deterministic ...

Memory Hierarchy Issues in Multicore Architectures J. Savage and M ...

An Experimental Study on Memory Allocators in Multicore and ...

A Comparative Study on Memory Allocators in Multicore and ...

Visual Memory Decay Is Deterministic - York University

Supporting Preemptive Task Executions and Memory ... - CiteSeerX

Supporting Preemptive Task Executions and Memory ... - CiteSeerX

Organizational memory supporting the continue

merasa: multicore execution of hard real-time applications supporting ...

merasa: multicore execution of hard real-time applications supporting ...

Multicore DIMM: an Energy Efficient Memory ... - Stanford University

Comparison of Memory Write Policies for NoC Based Multicore Cache ...

Static Memory Management within Bytecode Languages on Multicore ...

Software Thermal Management of DRAM Memory for Multicore Systems

The Impact of Diverse Memory Architectures on Multicore ... - Codeplay

Auto-tuning memory-intensive kernels for multicore - Computational ...