System-Enforced Deterministic Streaming for Efficient ... - Springer Link

2 downloads 1659 Views 719KB Size Report
providing basic, easy-to-use programming model with determinism guarantee. ... practical and efficient enough for pipeline program- ming. 2 SPMC Virtual ...... [4] Lu S, Park S, Seo E, Zhou Y. Learning from mistakes — A comprehensive study ...
Zhang Y, Li ZP, Cao HF. System-enforced deterministic streaming for efficient pipeline parallelism. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 30(1): 57–73 Jan. 2015. DOI 10.1007/s11390-015-1504-7

System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism Yu Zhang (张 昱), Member, CCF, Zhao-Peng Li (李兆鹏), Member, CCF, and Hui-Fang Cao (曹慧芳) School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China

E-mail: {yuzhang, zpli}@ustc.edu.cn; [email protected] Received July 16, 2014; revised October 14, 2014. Abstract Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, programming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage workers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications – ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively. Keywords

1

deterministic parallelism, pipeline parallelism, single-producer/multi-consumer, virtual memory

Introduction

Pipeline parallelism[1] is an important parallel programming pattern for streaming and other emerging applications. In the pattern, computation is divided into a sequence of stages. Data items in an input stream flow through stages. Not only can different stages run in parallel, but also there can be multiple workers for any given stage. Typically, these applications have been parallelized using low-level constructs of a threading library (e.g., Pthreads), which introduce pervasive nondeterminism, and lead to complexity and heisenbugs[2-4] . Programmers have to explicitly manage the communication and schedules of pipelines. They typically use circular bounded queues for communication between stages, share data through pointers or other structures among stage workers, and have to use locks or semaphores to coordinate threads. The

logical computation of pipelines becomes obscured by all these low-level coordination details, making it difficult to write, understand, and debug. Libraries with pipeline constructs such as Intelr Threading Building 1 Blocks (TBB)○ can abstract away some details, but provide nondeterministic semantics on shared memory accesses. In this paper, we present DStream, a new system that aims to support the development of pipelines by providing basic, easy-to-use programming model with determinism guarantee. DStream is a C library centered around a small set of functions on two parallel abstractions, i.e., threads and streams, which represent stage workers and communication channels between stages. In DStream, each thread can only share data or communicate with others via deterministic streams, besides inheriting any other memory state from its parent

Regular Paper Special Section on Computer Architecture and Systems for Big Data This work was supported in part by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010901, the National Natural Science Foundation of China under Grant No. 61229201, and the China Postdoctoral Science Foundation under Grant No. 2012M521250. 1 ○ http://www.threadingbuildingblocks.org/, Nov. 2014. ©2015 Springer Science + Business Media, LLC & Science Press, China

58

when it is started. Each stream has a single producer thread and one or more consumer threads. The deterministic producer-consumer relationship of a stream can be set up via direct parent-child interactions between threads. After setting up, the producer (or a consumer) can asynchronously produce (or consume) each item in order by simply invoking stream write/read functions. The DStream API abstracts away the details of how to synchronize item accesses between stage workers, and how/where to save, publish (denoted as fix) or remove an item. It is the responsibility of DStream system to guarantee the determinism, asynchronism, and unlimitedness of read/write accesses. DStream internally implements threads and streams using “copy-on-write (COW) by default” processes and our proposed single-producer/multi-consumer (SPMC) virtual memory[5-6] . The former ensures the isolation between threads by default, while the latter ensures deterministic memory accesses and synchronization behaviors between threads at the page granularity. To ensure deterministic accesses to items of any user-defined size atop the underlying deterministic SPMC pages, we present two strategies on when/how to publish (fix) an item atop the SPMC pages, i.e., the lazy and eager fix strategies, by leveraging memory utilization and access delay. To achieve good performance and high throughput, we present an extension mechanism to permit reusing a finite-size of SPMC virtual addresses to buffer unlimited-size items for maximum asynchronous communication between threads. To check the expressiveness and performance of DStream, we have converted two representative pipeline applications – content-based image retrieval (ferret) and compression (dedup) – from PARSEC[7] by elaborating deterministic schedules using DStream. These two have also been converted using TBB[8] . We compare the performance of various versions of dedup and ferret, either deterministic DStream and Dthreads[9] or nondeterministic Pthreads and TBB. An empirical evaluation shows that DStream is over 3X faster than Dthreads. For ferret, DStream performs on par with Pthreads and TBB in term of runtime. For dedup, DStream in lazy fix mode is close to 2.56X, 7.05X faster than Pthreads and 1.06X, 3.9X faster than TBB on 16 and 32 cores, respectively. Our main contributions are as follows. • We develop a C library of DStream, based on a small set of functions, which is both expressive and convenient.

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

• We show how the DStream API can be deterministic and scalable by investigating the fix and extension strategies of a deterministic stream abstraction atop the SPMC virtual memory. • We show how nondeterministic features in legacy multithreaded programs are converted to deterministic ones in DStream by the case study on ferret and dedup. • We demonstrate experimentally that DStream is practical and efficient enough for pipeline programming. 2 2.1

SPMC Virtual Memory Support for Concurrent Memory Access

Several alternatives to using shared memory with mutual exclusion primitives have been proposed in the past, including deterministic concurrency support[9-11] . They all provide the isolation between threads using COW technique that each thread operates on its own working copy by default, but propose different ways to handle with shared memory. Determinator[10] gives its user spaces no physically shared memory, and emulates shared memory using distributed shared memory techniques via explicit parentchild interactions. It supports only hierarchical synchronization such as fork/join, making interactions between child threads and their common parent a likely scalability bottleneck. Dthreads[9] is a deterministic version of Pthreads. It uses memory-mapped files to share the heap and globals across threads, maintains shared and private mappings to the files for each thread, and enforces determinism via a deterministic scheduler on serializing shared memory access. Conversion[11] extends virtual memory with commit() and update(), which can significantly simplify the memory management code of Dthreads. It maintains the version list of shared pages, so that a thread can retrieve and merge any changes committed to the trunk by calling commit(), and push any local changes to the trunk by a call to commit(). Rather than maintaining version list or merging disjoint modifications on shared memory copies from different threads, we proposed the SPMC virtual memory[5-6] , which allows threads to deterministically share physical pages through memory mapping techniques. 2.2

Principles on SPMC Memory

The SPMC virtual memory model allows a thread (emulated using a process) to establish direct “peer-

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

to-peer” SPMC regions between its arbitrary descendants, eliminating it as a scalability bottleneck in subsequent computation. The SPMC region is introduced by distinguishing two types of memory mappings — producer mapping and consumer mapping — which are mapped to the same shared SPMC page frames (also called physical pages), as shown in Fig.1. An SPMC implementation should enforce the following basic SPMC principles for race freedom. • Single-Producer-Only. To any SPMC region, there must be only one producer at any moment, accordingly avoiding WAW (write-after-write) races. • Consumed-After-Fixed. Any consumer can successfully read an SPMC page only after the producer explicitly publishes (fixes) it. Any attempt to access an unfixed SPMC page shall be blocked, accordingly avoiding RAW (read-after-write) races. • Fixed-at-Most-Once. The producer can explicitly fix each page frame mapped to its virtual page at most once, and then lose write permission to the fixed page frame. If the producer attempts to rewrite an SPMC region with new data, it shall explicitly remap its virtual range to new frames where new data will be written, without affecting the old frames that consumers have not yet read, accordingly avoiding WAR ( write-afterread) races. Thread T

Thread T

S: Producer Mapping

S: Consumer Mapping

S: Consumer Mapping

Shared Page Frames

Thread T

Shared Page Frames

S: Producer Mapping

va = spmcR alloc(size) allocates an SPMC region of the specified size starting from va in the current thread address space, and sets the current thread as the producer of the region. spmcR transown(child, sva, dva, size) transfers producer mappings from the region [sva, sva+size) in the current thread to the region [dva, dva+size) in a child thread, and after that, the source region is set as consumer mappings. spmcR copycons(child, sva, dva, size) hands out consumer mappings from the current thread to a child thread. A page fault raised from a producer means the faulting virtual page has not been mapped to an actual page frame. The fault handler would map the virtual page to a new page frame, and make the page writable. A read page fault raised from a consumer means the consumer attempts to read an unfixed SPMC page. The fault handler would check whether the SPMC page frame is fixed: if so, make the faulting virtual page readable; otherwise, block the consumer and add it to the wait consumer list of the page frame. A write page fault raised from a consumer or other irrelative thread indicates there is an error access, and then reports an error. spmcR setfix(va, size) allows the current producer to fix the specified SPMC region [va,va+size). Then each virtual page in the region becomes read-only, and each waiting consumer on the SPMC page frames is awakened. spmcR clear(va, size) allows the current thread to clear the SPMC region [va,va+size), i.e., removing mappings to internal SPMC pages. 2.4

S: Consumer Mapping

Fix (Producer) Read (Consumer)

Fig.1. Threads sharing SPMC regions.

2.3

SPMC Primitives

We define a set of implementation-independent SPMC primitives below for user programs to create, use, and destroy SPMC regions at page granularity. And these primitives can be implemented either in some specialized OS kernel[15] , or at the user-level of existing OS by using page-protection techniques[6] . Parameters, va, sva, dva, and size, below should be page-aligned.

59

SPMC Region Extension (Remapping)

To achieve “fixed-at-most-once”, an extend primitive is introduced to support remapping virtual pages of the producer or consumers to the next generation of page frames. A special carrier — extension page — is introduced and shared between the producer and consumers to record the remapping information generated by the producer’s extend call, which would be obtained by a consumer’s extend call, however late it may be. The producer and consumers must agree on which pages in the region are regular data pages and which are extension pages. spmcR extend(extva, va, size) allows the current producer or consumer to remap the SPMC region [va, va+size) via an extension page starting from extva. A

60

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

producer can invoke the primitive to extend an SPMC region by remapping the virtual range to new generation of page frames and saving remapping information into a special extension page in the region. A consumer can also invoke the primitive to get the remapping information via the extension page, and then remap its virtual range to the next generation of page frames according to the information. Read/Write Page Fault on an Extension Page. Both the producer and the consumer in application code must not attempt to access an extension page (doing so causes a page fault). The page fault handler will then report an error. 3 3.1

DStream Language Design Goals

The focus of DStream is to give a basic streaming framework to support the development of pipelines dedicated to multicores. We aim at satisfying several design goals. • Determinism. The programming abstractions should avoid introducing data races or other nondeterministic behavior in the first place, but not merely provide ways to control, detect, or reproduce races. • Transparency. Both the synchronization and the memory management for streaming data should be handled by the runtime system without any user intervention. • Scalability. There is no restriction on the number of consumers consuming a stream, and the size of data items through a stream. • Extendability. The API should be extendable to meet customized needs while providing basic functionalities. • Programmability. The syntax should be concise, easy-to-program, and type checking can be done at compile and run time. 3.2

DStream Programming Model

Pipelines are better understood when their internal data flow is characterized and analyzed. Two fundamental components are common to typical pipeline or streaming systems: data processing units and data links that connect them. In DStream, we refer to these as threads and streams. A thread consumes zero or more streams of any userdefined items and similarly produces any number of streams of identical or dissimilar data types. A stream

exists whenever there is a data flow between threads. From the thread’s point of view, a stream is effectively unbounded FIFO buffer to support deterministic data transfer from the producer to the consumer without introducing any data races. Firstly, to avoid WAW races, each stream has only one producer thread, but has any number of consumer threads. Thus, each stream internally has one write port for its producer and any number of read ports for consumers. Secondly, to avoid RAW races, the DStream runtime should ensure that each consumer consumes items already fixed on a stream in production order, and any attempting at reading an unfixed item is blocked until the producer fixes it. Thirdly, since any “unbounded” buffer is internally expressed by limited memory, the DStream runtime should provide a mechanism for reusing the limited memory to achieve unboundedness without introducing WAR races. The basic DStream language can be understood from an example in Fig.2. The DStream runtime environment is initialized and destroyed at the beginning and the end of main(), respectively. In main(), three threads are created to execute three tasks, i.e., T1 ∼T3 , respectively, and a stream s is created for communications between the producer t1 and consumers t2 and t3 . The producer-consumer relationship of a stream should be set up via parent-child interactions between threads, and any read/write accesses to the stream should not be allowed before the completion of the setup. After setting up, the producer and consumers can directly invoke stream write/read functions to publish or consume data items on the stream at their own execution paces. As shown in Fig.3, t3 is attempting to read the sixth item, but is blocked since t1 has not yet fixed it; while t2 has just read the first item. Moreover, each item size could be different from each other, and it is user-defined. Except shared streams, DStream gives each thread a complete, private virtual replica of state from its parent when it is started. A thread’s normal reads and writes, besides accesses to streams, only affect its private working state, and do not interact with other threads. As shown in Fig.2, the main thread can initialize a memory pointed to by the variable rodata, any child thread can read the shared memory through rodata copied from its parent, and it might modify its private copy. 3.3

Deterministic Threads

In DStream, each thread can only deterministically share streams with other threads, and inherit other

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism main() { .. . dstream init(nthreads); t1 = thread alloc(); t2 = thread alloc(); t3 = thread alloc();

lelism and parent-child barriers:

int s = dstream alloc(); // user-defined arguments thrdargs a1={.wport = s, ...}; thrdargs a2={.rport = s, ...}; // set producer-consumer dstream setcons(s, t2); dstream setcons(s, t3); dstream setprod(s, t1, 0); . .. set read-only shared data rodata thread start(t1, T 1, &a1); thread start(t2, T 2, &a2); thread start(t3, T 3, &a2); dstream destroy(); }

void * T 1(void *args) { thrdargs *a = (thrdargs *args); while(...) { get data item from rodata dstream write(a->wport, item, sz); } dstream end(a->wport); } void * T 2(void *args) { thrdargs *a = (thrdargs *args); while(...) { // get the pointer of the next item pitem = dstream read(a->rport, sz); // may update rodata process the item pointed by pitem } } void * T 3(void *args) { thrdargs *a = (thrdargs *args); while(...) { ...dstream read(a->rport, sz); ... } }

Fig.2. Simple DStream program.

Producer

Fixed

t1

Not Fixed Stream s

t2

Consumers

t3

61

Write Read Read (Blocking)

Fig.3. Threads on a simple DStream program (Fig.2) do stream accesses at their own execution paces.

memory state from its parent when it is started. There is only private heap for each thread, and no shared heap among DStream threads. DStream offers the following basic threading functions to construct fork-join paral-

// allocate a child thread instance int thread alloc() // fork the child and start the child to run (*fn)(arg) int thread start(int child, void *(*fn)(void *), void * arg) // wait for the child to return, merging the control flow void thread join(int child) // return to its parent thread void thread ret() // wait for the child to return, then restart the child again void thread sync(int child)

where thread start(c,...) and thread join(c) form a pair for fork-join parallelism, like the spawn-sync in Cilk Plus. When thread start(c,...) is invoked, DStream runtime would fork a child thread c, and copy the whole state besides streams from the current thread to the child, and then start the child to run. When a thread invokes thread join(c,...), it merely waits the child until the child finishes. Unlike Determinator, the parent does not copy any state from the child since shared data written by the child has already transferred through streams. A parent-child barrier is a pair of synchronization calls between a parent and a child, which can be constructed by calling thread sync() in the parent and thread ret() in the child. As shown in Fig.4, we can delay stream permission setting for t1 after starting t1 but before thread sync(t1) in main(), only if t1 does not access to the stream before calling thread ret(). The parent-child barriers provide a potential way on dynamically setting up a new stream and its producerconsumer relationship. As a first attempt, this paper only discusses the use of streams established before starting any sub-threads. main() {... rodata = ... thread start(t1, T 1, &a1); .. . dstream setprod(s, t1, 0); thread start(t2, T 2, &a2); thread start(t3, T 3, &a2); thread sync(t1); ... } void * T 1(void *args) { do anything independent to stream s thread ret(); while(...) { // get data item from rodata dstream write(a->wport, item, sz); } } Fig.4. Using a pair of parent-child barrier to delay stream permission setting.

62

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

Obviously, either fork-join or parent-child barriers provide deterministic parallel control flows between threads. Based on these threading functions, we can further define functions to manipulate thread pools. 3.4

Deterministic Streams DStream offers below streaming functions. // allocate a stream int dstream alloc() int dstream allocn(size t maxitemsize) /* set the specified child the producer of the stream, “cons” means whether the current thread becomes a consumer */ int dstream setprod(int stream, int child, bool cons) // set the specified child a consumer of the stream int dstream setcons(int stream, int child) /* write an item of size bytes pointed by buf to the stream*/ size t dstream write(int stream, void *buf, size t size) // return address of the next item in the stream void *dstream read(int stream, size t size) // close the stream int dstream end(int stream)

Each stream can be dynamically created by a thread, and obtain a globally unique ID. Programmers only need set up 1:1 or 1:n producer-consumer relationship for a stream by invoking dstream setprod() or dstream setcons() to do parent-child interactions. After setting up, the producer and consumers can directly invoke dstream write() and dstream read() to communicate unlimited data items at their own execution paces, unrestricted by the thread hierarchy. The stream write and read functions hide the details of synchronization and memory management from the programmer. We then emphasize the semantics informally as follows. Explicitly Setting the Producer/Consumers of a Stream Before Accessing the Stream. Any DStream thread can create a stream and become the producer of the stream. A single producer or any consumer of stream s can explicitly copy the consuming permission to any of its child threads by calling dstream setcons(). Only the producer can transfer its producing permission to a child by calling dstream setprod(), losing the producing permission accordingly. DStream further requires all permission transfers of a stream are done before any accesses to the stream, thus preserving the conceptual simplicity and determinism of the setting up process. Ensuring Deterministic Read/Write at UserDefined Data Item Granularity. The producer of a

stream can directly invoke dstream write() to publish a new item on the stream, while every consumer can invoke dstream read() to consume each item on the stream in production order. The size of an item is user-defined, and items published on a stream can have different sizes. The DStream runtime manages synchronization and memory on the stream, and ensures a consumer not to read any data item from the stream until the producer publishes (i.e., fixes) the item. It should be noted that a consumer need not offer a buffer to receive an item from the stream, but reads the item on the stream via the pointer returned from the invocation of dstream read(). Since the virtual address of an item in a stream buffer is exposed to a consumer, the DStream runtime must ensure any attempt to write the stream buffer by consumers or other malicious programs will cause a page/segmentation fault. Supporting Streams with Unlimited Buffer Size. Often programmers cannot estimate the length of a data stream and the maximum buffer size to achieve the best of asynchronous production and consumption, but can easily know the maximum item size. Thus DStream merely requires programmers to offer the maximum item size for some stream containing bulky items to ensure the internal limited buffer can continuously save a maximum item. The DStream runtime should support the “unboundedness” feature of a stream (physical memory limitations notwithstanding) by reusing limited virtual addresses and remapping them to different physical memory (see Section 4). Backing up a Consumed Item Which Needs to Be Read Afterward. From the consumer’s point of view, a consumed item pointed by the pointer returned from an invocation of dstream read() may become invalid after the consumer invokes several subsequent dstream read() calls. The reason is because the virtual addresses once occupied by the item may be flushed by the DStream runtime system to reuse. If an item being consumed would be read again in the future, it should be backed up to the consumer’s private memory in time. Efficiently Supporting Multicast Communication Patterns. Unlike classic Unix pipes, DStream allows the producer of a stream to publish its items in the shared physical memory, and multiple consumers of a stream can concurrently read the same physical memory, thus efficiently supporting multicast communication patterns.

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

4 4.1

DStream Runtime System Race-Free Thread Model

Conventional systems give threads direct, concurrent accesses to shared data, yielding data races and heisenbugs if the threads fail to synchronize properly. DStream replaces the standard concurrent access model with a workspace consistency model[12] , in which data races do not arise in the first place. The model supports two kinds of sharing, i.e., inheriting memory state from the parent, and direct read-write sharing only via streams. Inheriting State from the Parent. Like Determinator, the model gives each thread a complete, private virtual replica of the state except streams from its parent when it is started. A thread’s normal reads and writes on the state affect only its private working state, and do not interact with other threads. Unlike Determinator, each thread’s changes to its private copy of the state do not expose to other threads. In the example in Fig.2, threads t1 and t2 read the “prior” state set by their parent in rodata, and t2 further updates the state “in-place”, without any explicit copy or additional synchronization. With conventional threads, this code has a read-write race: t1 may see an arbitrary mix of “old” and “new” states in rodata. Under DStream, however, this code is correct and race-free. Each thread only reads its private working copy of rodata, which cannot be touched by other threads. Read-Write Sharing via Streams. In DStream, besides inheriting state from the parent, a thread can only share data with others via streams. From Subsection 2.2, we know that SPMC regions can support deterministic “peer-to-peer” communication between threads at page granularity. The three rules in SPMC protocols avoid all kinds of races at page granularity, and also provide the capability to reuse limited virtual range to transfer unlimited data stream. Accordingly, SPMC regions can serve as internal buffers of streams in DStream. The DStream runtime could guarantee the high level deterministic semantics of streams by carefully using page-aligned SPMC regions. See the example in Fig.2 over again, threads t1 ∼t3 read or write data items on a stream through stream read/write functions, without any explicit copy, synchronization and memory management on the stream buffer. With conventional threads, this code has a read-write race: t2 or t3 may see an arbitrary mix of “uncreated”, “partly written”, “written but not fixed”, and “fixed” states of data items in the stream. Under

63

DStream, however, this code is correct and race-free. The underlying implementation atop the SPMC would ensure each consumer can read items only after they are fixed. The “single producer for each stream” rule ensures there are no WAW races in the code. 4.2

Emulating Streams atop the SPMC

Each stream internally represents as an SPMC region of fixed size, either by default or user-defined. DStream runtime system maintains metadata to record the status of active streams, such as offsets at which a producer puts or a consumer gets the next data item in the SPMC region. Since the SPMC memory guarantees determinism at page granularity, we need to elaborate the fix and extend policies for streams to ensure determinism at user-defined size and high throughput. Fix Policy. The SPMC primitives provide pagealigned fix mechanism. Yet at the programming level, DStream allows a consumer or producer to read or write an item of any user-defined size on a stream, and thus an item size might be much smaller than or larger than the page size (i.e., 4KB), seldom exactly a multiple of the page size. It is worth thinking deeply about how to put an item on the SPMC memory, as well as when and how to fix it. We propose two optional strategies below. 1) Eager strategy: fix an item ASAP. In this strategy, each item is internally saved to a page-aligned address range in the internal SPMC region, and fixed immediately after it is written to the region. This strategy can cause logically that successive items might be put in discontiguous address ranges, requiring consumers not to do a read across two or more items put by the producer. For example, it cannot allow the producer to put two 8-byte items, but a consumer attempts to get them in a single dstream read() call. 2) Lazy strategy: fix a page when it is full or the last one of a stream. In this strategy, an item is saved to an address range immediately after the previous item and may not be page-aligned. Thus a single page may save part or full of several successive items. An item cannot be completely fixed immediately after it is written, because the last page the item occupies may not be full and needs to be fixed by some subsequent dstream write() call. DStream runtime system needs to maintain the number of fixed pages for each stream, compute how many “filled but unfixed” pages there are, and then fix them. DStream also needs to provide a mechanism on fixing the last unfilled page of a stream, which could be done in dstream end().

64

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

The eager strategy fixes items in time, but introduces more space overhead. The lazy strategy delays fix operations, but may be beneficial for better locality, and it allows consumers to do a read across several items written by the producer. Extension Policy. By means of SPMC extend primitive, we design an extension policy to support unlimited streams. Substantially, the “extension” of a stream for the producer or a consumer is to asynchronously remap an SPMC address range in the producer’s or consumer’s address space to different page frames again and again. These “extension” operations are invoked by the implementation of dstream write() or dstream read() when needed, which is hidden from the user. A dstream read() call directly returns the address of a consuming item, rather than copies the item to the consumer’s buffer. Therefore, DStream should ensure each item consumed occupying a contiguous SPMC address range in the consumer’s address space. Moreover, a stage in some application, such as FindAllAnchors() in dedup from PARSEC v2.1[7] , may produce an item composed of several (often two) interdependent subitems, where the preceding subitem contains the length of the subsequent variable subitem and other information. These subitems are sent via several dstream write() calls, so that a consumer can read the first one and get the size of a variable subitem in advance. The consumer then may alternately get information from these subitems to deal with the whole item. For this purpose, DStream requires a programmer to provide the maximum item size for some stream containing bulky items, so as to allocate an SPMC region of at least twice of the maximum item size. The default size of an SPMC region for a stream is set to 128KB and can be easily changed as required. To support unlimited streams and meanwhile allow consumers to access at least two successive items alternately, we divide each internal SPMC region of a stream into two parts of nearly the same size, and reserve one page for each end of the region, serving as an extension page to extend the corresponding half of the region. Assume that there is an 8-page-SPMC region in a stream and the underlying SPMC takes lazy page mapping policy described in [6]. Fig.5(a) shows the initial state of the region when the producer-consumer relationship is set up. All pages within the region in the producer’s or consumer’s address space are mapped to 2 an anchor page○ A for lazy page mapping. Fig.5(b)

shows the state that the producer has already fixed six data pages, and is attempting to write the 7th one, while the consumer is just reading the 1st data page mapped to the 2nd virtual page. Note that anchor page A serves as a page table containing entries corresponding to the six data pages. Thus the consumer can obtain the address of the 1st data page through anchor page A mapped to its 2nd virtual page, and then map the 2nd virtual page to the 1st data page for direct reading. At the same time, since the producer virtual pages are filled, the producer invokes extend primitive to do extension. Fig.5(c) gives the result of the extension, where the producer remaps its first half of the region to a fresh anchor page B, but reserves the mappings of the second half. The producer then cannot read data pages 1 ∼ 3 any more. Meanwhile, the address of page B is saved into the corresponding entry in page A, and thus the consumer would access B through its mapping to A. Fig.5(d) shows the state that the producer successively writes the 7th data page after the extension. The rest may be deduced by analogy. That is, when the producer attempts to write the 10th data page, the extension of the second half in producer VM happens. 4.3

Prototype Implementation

To generalize DStream and support more realistic applications, we retrofit it on Linux. DStream emulates threads using Linux processes, and ensures the isolation between threads by default via COW. Besides, threads can communicate with each other via streams atop the SPMC memory. DStream also gives each thread a separate sub-heap, managed by a variant of Doug Lea’s 3 malloc○ with about 170 modified source lines of code (SLOC). For quick prototyping, the SPMC virtual memory is entirely emulated in Linux user space via the disciplined use of conventional mmap memory to emulate per-process page tables and physical memory needed by SPMC memory management. We elaborate access permission transitions for SPMC regions via the disciplined use of mprotect system calls. Thus some read/write operations on SPMC regions from application code would trigger segmentation faults, and the control is then transferred to the segmentation fault handler, emulating page faults and control transfers. For simplicity and memory saving, an SPMC page has a corresponding 32-bit page table entry in a per-

2 ○ “Anchor page” is just the “shadow page” described in our previous paper[6] . 3 ○ dlmalloc v2.8.6, Aug. 2012. http://g.oswego.edu/dl/html/malloc.html, Nov. 2014.

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism Producer VM

Producer VM

Producer VM

1 2 3 4 5 6

4 5 6

65

Producer VM 4 5 6

7

Private Shared A 1 2 3 4 5 6

A

B

A 1 2 3 4 5 6

7 B

A 1 2 3 4 5 6

Private Shared 1 Consumer VM

1 2 3

Consumer VM

(a)

Consumer VM

(b)

Extension Page Mapping Vitural Memory:

Extension Page

Physical Memory:

Per-Process Ptab

(c) Mapping to Anchor Page Anchor/Extension Page

1 2 3 4 5 6 Consumer VM (d) Mapping to Data Page Data Page

Fig.5. Extension policy to support unlimited stream. (a) Initial stage. (b) The state with fixed 6 data pages to write the 7th one. (c) Extension result. (d) The state with the 7th data page written after extension.

process page table, which contains three protection bits controlling the status of SPMC page mapping, and a 28-bit offset indicating the corresponding page in the emulated physical memory of at most 228 4KB-pages. Our current implementation requires page table entries of an extension page, and the SPMC address range extended via the extension page should be saved in a single page and can be distinguished by the entry offsets in the page. Since a page can hold 1 024 32-bit entries, the maximum size of an extensible address range via an extension page is (4MB−4KB = 4 092KB). Thus the current DStream can provide an SPMC region of at most 8MB (including two 4KB-extension pages) for a stream, which limits the supported maximum item size. This limitation would be eliminated by exploring more flexible extension mechanism in the underlying SPMC model in the future. 5

Applications

There are three benchmarks in PARSEC that exhibit pipeline parallelism in different forms using Pthreads and provide a good example of the cases that programmers might encounter. x264 (29328 SLOC) has the most source lines of code and is not supported by Dthreads or TBB, thereby we only convert dedup (2566 SLOC), ferret (10769 SLOC) using DStream. This section concludes converting rules using DStream. 5.1

Parallel Features in ferret and dedup

Table 1 summarizes parallel features used in ferret and dedup. We can easily find that although ferret has

a larger SLOC value than dedup, parallel features used in ferret are a subset of those in dedup. Fig.6 shows the program architecture of dedup. Fork-Join Parallelism. dedup has a 5-stage pipeline, while ferret has a 6-stage pipeline. Both of them have a serial input stage (S1 ) and a serial output stage (S5 or S6 ). The main thread in each program creates one thread for each serial stage and n threads for each parallel stage, and joins all stage threads when they finish. Queue. Both dedup and ferret use a set of circular queues for both communication between two adjacent stages and load balancing. The producer-consumer relationship of a queue might be 1:m, m:m, or m:1. Each queue is synchronized by using a mutex to avoid data races, and using conditional variables for empty/full blocking. dedup includes all cases in ferret on using queues. Furthermore, as shown in Fig.6, to decrease mutex contention, dedup scales the number of queues k between two adjacent stages with the number of threads n for a parallel stage. Thus stages S1 and S5 do production or consumption on k queues in round-robin fashion. Hash Table. dedup maintains a shared hash table to store unique fingerprints of data chunks. Each P3 thread does ChunkProcess to check whether a chunk is in the hash table or not. If not, put the chunk into the table and the queue for stage P4 . If yes, put it into the queue for stage S5 , bypassing stage P4 . Each P4 thread does Compress to compress a received item and put the compressed data into the hash table, and then put the item into the queue for stage S5 . The condition variable empty of a hash value is initialized by a P3 thread

66

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1 Table 1. Parallel Features in dedup and ferret from PARSEC v2.1

dedup

ferret

Pipeline

S1 P2 P3 [P4 ]S5 , might bypass P4

S1 P2 P3 P4 P5 S6

Data flow for a stage

P2 : take one item, produce multiple; other P stages: take one, produce one

P stages: take one, produce one

Shared data

k 1:m queues between stages S1 and P2 ;

A single queue for each pair

k m:m queues between Pi and Pi+1 ;

of two adjacent stages (k > 1, i = 2, 3)

k 2m:1 queues between P3 , P4 , and S5 A hash table



• Shared data pointed by pointers in the above shared data structures; • Shared data initialized by the main thread and read by some child threads Competition sync.

A mutex for each queue

Cooperation sync.

Two condition variables for each queue to coordinate producers and consumers, e.g., full, empty in dedup and not full, not empty in ferret

A mutex for each entry in the hash table



A condition variable empty for each hash value to coordinate among P3 , P4 , S5 Scheduling policies



m producers or m consumers share a queue for load balancing • S1 sends items into k queues in round-robin fashion;



• S5 processes k queues in round-robin fashion; • Check the hash table to decide whether to bypass stage P4 Note: Si : the i-th serial stage, Pi : the i-th parallel stage.

F1 AQ1

Input

D

.. .

...

Fm

.. .

CQ1

.. .

Fi AQk

...

Fn Stage S1

Stage P2

...

CPm

.. .

PQ1

.. .

CQk

...

Cnk

.. .

SQ1

.. .

S

Output

Ci

CPi

.. .

Cache

C1

CP1

PQk

CPn

...

SQk

Thread

Cnt

Stage P3

Stage P4

Queue Stage S5

Tasks: D - DataProcess, F - FindAllAnchor, CP - ChunkProcess, C - Compress, S - SendBlock Queues: AQ - anchor_que, CQ - chunk_que, PQ - compress_que, SQ - send_que Hash Table: Cache Fig.6. Program architecture of dedup.

after the hash value is created, and signaled by a P4 thread to notify the S5 thread that the data has been compressed. Pointers in Shared Data. Both programs widely use pointers in shared data structures to indirectly share more data among threads. It reduces data copy overhead, but makes data sharing more implicit and error-prone. Determinism and Nondeterminism. Considering the determinism of the above parallel features, we can easily identify the following features are determinis-

tic: fork-join parallelism; shared data initialized by the main thread and read by some child threads; and the order of data items in each queue put by stage S1 . Others are nondeterministic. 5.2

Rules on Converting Pthreads to DStream

The first two deterministic features mentioned above can be easily converted using the fork-join parallelism and COW mechanism in DStream. Queues can be replaced with deterministic streams, and synchro-

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

nization on queues via mutex and condition variables can be removed. But how to configure streams to replace all kinds of queues in order to achieve both determinism and efficiency is still a key point. We then discuss the conversion of different kinds of queues and other parallel features. Converting 1:m or m:1 queues. In dedup and ferret, stage S1 is essentially a splitter of RoundRobin type mentioned in StreamIt[13] , which sends items to n threads of stage P2 via k queues or 1 queue in round-robin fashion. These queues can be replaced with n 1:1 streams, thereby the S1 thread could put items on the n streams in round-robin fashion. The last stage is substantially a joiner of RoundRobin type mentioned in StreamIt, whose function is analogous to a RoundRobin splitter. Therefore, m:1 queues used by the joiner can be similarly replaced with n 1:1 streams, and items on the n streams would be processed by the joiner in round-robin fashion. The “round-robin” schedule can provide both load balancing and determinism. In addition, a splitter of Duplicate type mentioned in StreamIt can directly use a 1:n stream to multicast items to n threads of the next stage. Converting m:m Queues. For communication between two adjacent parallel stages (n workers for each stage), it is often to use an n:n queue or k m:m queues for load balancing between n producers and n consumers, leading to nondeterministic production and consumption orders. Nondeterminism can be eliminated by replacing the queue(s) with n × n streams for n × n producer-consumer pairs and applying roundrobin schedule on n streams in each producer and consumer. Yet, it complicates the process in each consumer, and would cause a round-robin consumer to block at some stream even if any other n − 1 streams have available items. To avoid the cases, n streams are introduced to match each producer to only one consumer, and thus a consumer can directly process items sent from the matched producer. Converting Shared Hash Table. The hash table in dedup makes more interdependence between stages, no matter whether they are adjacent or not. As shown in Fig.6, each F thread splits a received data block into several smaller chunks and sends chunks to next CP threads in round-robin fashion. Each CP thread computes the hash value of each processed chunk and searches cache to decide whether to insert a chunk. Each C thread compresses a received chunk and searches cache to decide where to insert the compressed data. The S thread checks cache to acquire the compressed

67

data and identify whether it has been written. DStream does not allow any nondeterministic shared data accesses. Thus each related DStream thread has to maintain its own copy of hash table for uniqueness checking. Based on the original round-robin scheduling, it is a challenge to efficiently maintain the consistence of multi-copies among threads. To meet the challenge, we separate the hash table into n sub hash tables, and let each CP or C thread deal with items whose hash values are in the range of the same sub hash table. We then move the hash value computation from CP stage to F stage, each F thread decides which CP thread would receive a chunk according to its hash value, and each CP thread decides whether a chunk need be compressed by checking and maintaining its own sub-hash table. To eliminate that the S thread depends on the hash table, each chunk item received by the S thread should contain its compressed data or its fingerprint. Based on the default copy-on-write semantics, the main DStream thread only need create a hash table of ⌈1/n⌉ length of the original hash table, and each CP thread need adjust indices between the original one and the corresponding index in the ⌈1/n⌉ table. Converting Stage Bypassing. Due to nondeterministic recurrence rate of data chunks, it is difficult to arrange an efficient and load-balancing round-robin schedule in the joiner, i.e., S thread, if directly setting up n streams between CP and S, and another n streams between C and S. Our solution is to cancel stage bypassing, that is, there is no direct communication between CP and S in the converted dedup. Each CP thread sends an item, either a chunk to be compressed or a fingerprint to its corresponding C thread, and each C thread can directly send a fingerprint, or compress a chunk and then send the compressed data to the S thread. Converting Shared Pointers. Based on the above conversion policies, some data pointed by shared pointers would be converted to read-only sharing or private data, but the conversion of others is still a question. For the latter case, programmers need to adjust data structures of items, and require a stream producer putting a copy of data rather than a pointer to the data on the stream, allowing consumers to share the data on the stream via pointers returned from dstream read() calls. Others. In DStream, since a stage thread has different streams for putting its produced items and getting items to be processed, it is easy to handle “single input, multiple output” appearing in dedup F stage, which is a challenge in TBB[8] .

68

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

5.3

Results of Converting dedup and ferret

Fig.7 shows the converted deterministic schedules for dedup and ferret, named as d-dedup, d-ferret, d-ferret-p, and the SLOC comparisons between various versions. For ferret, d-ferret has the same stages as the Pthreads one, but d-ferret-p packs the middle four stages into a bigger stage to avoid communication within the four stages. Assume there are n threads to run each parallel stage, the converted dedup needs to set up n2 +4n streams, and the unpack and pack converted ferrets need 5n or 2n streams, respectively. 6

Evaluation

We perform our evaluation on a 32-core Intel 4X Xeon E7-4820 system equipped with 128GB of RAM. The operating system is Ubuntu 12.04. Benchmarks were built as 64-bit executables with gcc -O3. We logically disable CPUs using Linux’s CPU hotplug mechanism, which allows to disable or enable individual CPUs by writing “0” (or “1”) to a special pseudo file (/sys/devices/system/cpu/cpuN/online), and the total number of threads was matched to the number of CPUs enabled, e.g., one thread to run each parallel stage when the number of CPUs is less than 8. Each workload is executed 10 times. To reduce the effect of outliers, the lowest and the highest runtime for each workload are

discarded, and thus each result is the average of the remaining eight runs. 6.1

Determinism

The determinism of DStream is ensured by the following points: 1) determinism of the underlying SPMC memory at page granularity, which is ensured by any SPMC implementation to satisfy the race-free SPMC protocols; 2) determinism of the DStream API’s semantics, which requires the programmer explicitly setting deterministic producer-consumer relationship of a stream before accessing it; 3) determinism at userdefined item accesses, which is ensured by the fix and extend policies of streams in DStream implementation. To check the determinism experimentally, we execute d-dedup with larger test input of compressing a file of about 184MB many times to count the number of items and total item size through each stream. In d-dedup, each item in streams d2f[j], f2cp[i][j], cp2c[j] and c2s[j] consists of a fixed structure (the former two are both 40 bytes and the latter two are 48 bytes) and a variable string whose size is specified in the former. Each run could get the same results. Figs.8(a) and 8(b) show the results running each parallel stage with 8 threads, where j is the x-axis. d-ferret: Unpacked Version l2s L

f2s d-dedup d2f

F1

f2cp

F2

Input

D

. . .

Fn

CP1

cp2c

C1

CP2

C2

. . .

. . .

CPn

c2s

S1

s2e E1

e2i

I1

i2r

R2

. . .

. . .

. . .

. . .

Sn

En

In

Rn

r2o O

d-ferret-p: Packed Version S

P1

Output l2p L

Cn

S1

I1

E1

. . .

R1

. . .

p2o O

Pn Sn

In

En

Rn

Data Passing via a Stream

(a) Version

(b) Same

Modified

Added

Removed

dedup

DStream vs Pthreads

1 865

123

157

360

ferret

Unpack vs Pthreads

8 801

83

182

171

Pack vs Pthreads

8 738

63

98

254

Pack vs Unpack

8 865

24

10

177

(c) Fig.7. Deterministic dedup and ferret: schedules and code size comparisons. (a) Deterministic schedule for dedup. (b) Deterministic schedules for ferret. (c) Source lines of code (SLOC) comparisons. L: load, S: segment, E: extract, I: index, R: rand, O: output.

1

2

f2cp1 f2cp5 d2f

3

4

f2cp2 f2cp6 cp2c

5

6

7

8

f2cp4 f2cp8

f2cp3 f2cp7 c2s

8

Run Time (s)

105 104 103 102 101 100

Total Item Size via a Stream (page) (Τ103)

Number of Items via a Stream (Log Scale)

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

6 4 2 0 1

2

3

f2cp1 f2cp5 d2f

(a) Stream

4

f2cp2 f2cp6 cp2c

5

6

f2cp3 f2cp7 c2s

7

8

f2cp4 f2cp8

69

7 6 5 4 3 2 1 0 1

2

Number of Pages Needed (pn)

Number of Pages Size of the Internal SPMC Region (s) Allocated (pa)

d2f

Eager Lazy

192 168

67 67

47 126.2 47 126.2

47 277 47 168

f2cp

Eager Lazy

188 324

3 337

48 228.8

31 600

678

48 228.8

cp2c

Eager Lazy

188 268 25 375

3 417 759

c2s

Eager Lazy

188 268 16 516

3 096 321

4

5

6

CP-fix1 CP-fix2

7

8

C-fix1 C-fix2

(c)

(b)

Fix Fix Counts Extension Counts (ce) Strategy (cf)

3

F-fix1 F-fix2

Page Use Ratio Address Space (ru=pn/pa) (%) Extension Times (re)

8MBΤ8 8MBΤ8

99.7 99.9

5.2 5.2

215 495

512KBΤ64

22.4

27.1

48 554

512KBΤ64

99.3

6.3

47 850.3 47 850.3

215 406 48 191

512KBΤ8 512KBΤ8

22.2 99.3

214.6 48.4

20 807.2 20 807.2

195 640 20 923

512KBΤ8 512KBΤ8

10.6 99.5

194.5 21.1

(d)

Fig.8. Comparisons between two fix strategies: running each parallel stage with 8 threads. (a) Number of items of each parallel stage with 8 threads. (b) Total item size of each parallel stage with 8 threads. (c) Run time of each parallel-stage thread (x-fix1 means the x stage in dedup-f1 and x-fix2 means the x stage in dedup-f2). (d) Extension ability for each parallel stage with 8 threads.

6.2

Extension Ability for Unlimited Streams

The D stage in original dedup uses a very large buffer (600MB) to buffer a data chunk from the input file. By the limitation in our current implementation of the SPMC model (see Section 4), we use an 8MB stream for communication between each pair of D and F stages, and use streams of default size (512KB) for other stage communication. When performing each parallel stage in eight threads, the last column in Fig.8(d) lists extension times re of each kind of streams which are computed by (ce /2+n)/n, where n is the number of streams of this kind. For cp2c streams, the extension times can reach 214.6 when using the eager fix strategy, and the actual number of pages mapped to the 8×512KB virtual addresses (pa ) is 215 495 (about 841MB), which far outweighs 4MB. 6.3

Performance

We evaluate the performance of DStream versus Pthreads, Dthreads, and TBB on various versions of dedup and ferret using the large test input. Fig.9 presents the runtime overhead of DStream, TBB, and Dthreads versions relative to Pthreads counterparts. Since we met stucks when running dedup-dthreads with four CPUs, we do not include results of dedup-dthreads. We use -f1 or -f2 in the bench-

mark names to represent the eager or lazy fix strategy mentioned in Section 4, respectively, and use -p to indicate it is a packed DStream version of ferret. From the figure, we see that Dthreads ensures determinism, but has more than 2.6X overhead than the others. All TBB and DStream versions of dedup and ferret have nearly the same or less runtime relative to the Pthreads counterparts. For dedup, dedup-tbb has better performance than DStream when running on 2, 4, and 8 CPUs, but dedup-f2 has the best performance on 16 and 32 CPUs. Here dedup-f2 is 2.56X, 7.05X faster than the Pthreads counterparts, and 1.06X, 3.9X faster than the TBB counterparts on 16 and 32 CPUs respectively. For ferret, the packed DStream version has better performance than the unpacked one, and they both perform on par with their Pthreads and TBB counterparts in term of runtime. ferret-tbb has the best performance on eight CPUs, but costs more time than the packed DStream versions after eight CPUs. 6.4

Scalability

Fig.10 shows each benchmark’s speedup relative to its own single-CPU execution. Pthreads has worse scalability on dedup than on ferret, because there is more contention overhead on accessing the hash table in dedup after eight CPUs. All benchmark versions level

70

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

3.0

3.0

2.5

2.5

2.0

2.0

1.5

1.5

3.2

9.0

13.5

ferret-f1 ferret-f2 ferret-f1-p ferret-f2-p ferret-tbb ferret-dthreads

1.0

1.0

dedup-f1 0.5 dedup-f2 dedup-tbb 0

0.5 0 1

2

4

8

16

32

1

2

4

8

NCPU

NCPU

(a)

(b)

16

32

Fig.9. Runtime overhead relative to Pthreads on various benchmarks. 10

10

8

8

6

6

4

dedup-f1 dedup-f2 dedup-tbb deduppthreads

2 0 1

2

4

8

16

32

4 2 0 1

NCPU (a)

2

4

8 NCPU

16

32

ferret-f1 ferret-f2 ferret-f1-p ferret-f2-p ferret-tbb ferretdthreads ferretpthreads

(b)

Fig.10. Parallel speedup over its own single-CPU performance on various benchmarks.

off when there are less than or equal to four CPUs, because they all run each parallel stage in one thread. TBB scales well before eight CPUs on dedup and 16 CPUs on ferret, but does poorly afterwards. DStream benchmarks scale well with CPU counts as a whole, which can be attributed to the underlying SPMC memory mapping mechanisms in DStream. Thereinto, the synchronization between the producer and consumers only happens when a consumer is attempting to read data which has not been fixed by the producer, and in other cases, all producer and consumers can perform asynchronously at their own paces. 6.5

Comparison Between Two Fix Strategies

From Fig.9, we observe that dedup-f1 has worse performance than dedup-f2. We further analyze detailed overheads of two versions by running each parallel stage with eight threads. Fig.8(d) lists the number of pages either allocated or needed for each kind of streams. Except d2f streams, pa values in dedup-f1 are 4X∼9X of those in dedup-f2, and page use rates in dedup-f1 are below 22.5%, whereas those in dedup-f2 are nearly 100%, showing more overhead on page allocation in dedup-f1

than dedup-f2. Fig.8(c) shows the runtime for each parallel stage thread of two versions, also confirming that each parallel stage thread in dedup-f1 takes at least 2 s more than that in dedup-f2. 6.6

Load Balancing

We then discuss load balancing of the deterministic scheduling for dedup in Fig.7(a) through performing dedup-f1 and dedup-f2 with eight threads for each parallel stage. The output policies of stages D and F are key factors influencing load balancing. From Figs.8(a) and 8(b), we observe that items on each d2f stream put by the single D thread in round-robin fashion are balanced. Items on each f2cp stream put by eight F threads according to the computed hash values of items are balanced in most cases, but there are peaks on each stream received by the third CP thread. This reflects that there are more items in the input test whose hash values are in the sub hash table handled by the third CP thread. The detailed runtime of each thread shown in Fig.8(c) reflects that the lazy fix strategy can reduce the imbalance.

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

6.7

More Analysis by Using Linux Perf

To understand more detailed underlying program behaviors, we use Linux Perf v3.2.14 to profile different dedup versions by running them in 8, 16, and 32 CPUs respectively using perf stat -d -r 3. Table 2 lists part of the profiling data output by Perf. From the table, we observe that two dstream versions have more page-faults than the others, therefore leading to higher overhead. This is totally because we implement the SPMC memory totally in the user space for quick prototype. We convince that the pagefaults would be significantly decreased after implementing the SPMC memory in the kernel space. Due to the eager fix strategy, dstream-f1 has nearly 4X pagefaults compared with dstream-f2. For LLC (last-level cache) loads, both dstream versions have more than 40% cache hit ratios, much higher than pthreads and tbb, reflecting better locality. We also observe that both dstream versions have stable LLC loads and page-faults on 8, 16, and 32 CPUs, which means the change of thread number has little effect on the number of memory accesses, accordingly reflecting better scalability of DStream. However, LLC loads, page-faults, context switches, and CPU migrations of the tbb version have significantly increased when running on 32 CPUs, leading to its poor performance on 32 CPUs. We still do not understand the relationship between the elapsed time and the absolute values of context switches, CPU migrations, and IPC (instructions per cycle), which is a subject for further study. 7

Related Work

Deterministic Multithreading Systems. There have been a great many of determinisitic multithreading

71

(DMT) systems. We focus here on software-only, non language-based approaches. Most of them, e.g., [9,14-17] are backward-compatible, and schedule nondeterministic inter-thread interactions on a repeatable, artificial time schedule. Determinator[10] , however, offers a new programming model and OS redesigned to eliminate conventional data races and hide nondeterministic kernel state from applications. Like Determinator, DStream offers a new thread model with deterministic semantics, but supports “peer-to-peer” communication and synchronization, which Determinator does not. DStream isolates threads by running them in separate processes, as Determinator, Grace[15] and Dthreads[19] do, but augments the SPMC to allow deterministic read-write memory sharing among threads. Although DMT is commonly believed to simplify the testing, debugging, and record-replay of multithreaded programs, making them scale well remains a challenge[18] . The latest Parrot[17] improves performance by introducing performance hint abstractions and their runtime. All three pipeline workloads from PARSEC[7] , i.e., dedup, ferret and x264, still have more than 1.7X to 3.72X overhead over nondeterministic execution (see Fig.8 in [17]). DStream enables higher performance than DMT systems and is close to or faster than Pthreads in runtime. Programming Models for Pipelines. There have been several attempts at providing higher-level of abstraction for expressing pipeline parallelism. Stream programming languages (e.g., StreamIt[13,19] ) enable explicit syntax for data, task, and pipeline parallelism, but remain a challenge to compilation and optimization. Task-based libraries such as TBB, Cilk

Table 2. Some Profiling Data of Various dedup Versions Collected by Linux Perf Number of CPUs Benchmark 8

16

32

pthreads dstream-f1 dstream-f2 tbb pthreads dstream-f1 dstream-f2 tbb pthreads dstream-f1 dstream-f2 tbb

Time (s) IPC Context Switch CPU-Migrations Number of Page-Faults LLC Loads LLC Hits (%) 9.671

1.28

129 937

0 105

1.15E+06

6.27E+07

19.67

8.976

0.95

012 516

0 029

5.88E+06

1.14E+08

49.63

6.632

1.22

011 534

0 028

1.61E+06

5.03E+07

41.90

3.761

1.38

023 924

0 020

6.16E+06

3.23E+08

31.00

13.532

0.96

262 689

1 619

1.14E+06

1.07E+08

13.05

6.547

0.89

004 804

0 041

5.93E+06

1.15E+08

55.16

3.732

1.20

006 789

0 043

1.61E+06

4.95E+07

44.04

4.776

1.14

065 396

0 701

6.88E+05

5.52E+07

18.16

20.968

0.75

369 932

1 240

1.13E+06

1.86E+08

08.38

8.032

0.42

009 153

0 146

6.03E+06

1.29E+08

54.40

2.547

1.19

004 188

0 223

1.65E+06

4.99E+07

45.76

10.458

0.59

175 676

1 200

7.92E+05

1.35E+08

10.88

72

J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

4 5 Plus for C/C++○ and TPL Dataflow in .NET○ offer attractive alternatives for legacy code. TBB includes specialized pipeline constructs for expressing restricted pipelines[8] . Cilk Plus can only express a serial-parallel-serial 3-stage pipeline by clever use of spawn-sync and reducer constructs[1] . TPL Dataflow includes specialized dataflow constructs for general producer/consumer relationships, but only supports .NET platform. They all rely on programmers to avoid race conditions and other errors caused by nondeterminism. DStream currently does not contain higher-level constructs such as StreamIt’s Pipeline, SplitJoin, and FeedbackLoop constructs for composing pipeline stages into a communicating network, but offers stream abstraction for automatic management and optimization of communication, and can be used to greatly simplify tasks of a streaming language compiler or a scheduler.

8

Conclusions

We presented DStream, a simple and practical programming model for deterministic pipeline parallelism. It offers race-free multithreaded model, and scalable deterministic streams for communication between pipeline stages. The case study on converting two pipeline applications and experiments shows that DStream can bring determinism and is easy-to-use, fast, and scalable. Current DStream requires the programmer to directly use threads to represent pipeline stages, and arrange deterministic schedules by configuring streams. Our future work will provide higher-level pipeline programming abstractions like those in stream languages, which would be mapped to the basic thread and stream abstractions proposed in this paper through compiler techniques and scheduling algorithms. We will also study dynamic but deterministic load balancing mechanisms for partitioning hash table and other data containers. Acknowledgement We thank Dao-Chen Liu, a master student, for testing pre-release versions. We also thank the anonymous reviewers for their constructive suggestions and great enthusiasm. References [1] McCool M, Reinders J, Robison A D. Structured Parallel Programming: Patterns for Efficient Computation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2012.

[2] Artho C, Havelund K, Biere A. High-level data races. In Proc. the 1st International Workshop on Verification and Validation of Enterprise Information Systems, April 2003, pp.82-93. [3] Lee E. The problem with threads. Computer, 2006, 39(5): 33-42. [4] Lu S, Park S, Seo E, Zhou Y. Learning from mistakes — A comprehensive study on real world concurrency bug characteristics. In Proc. the 13th International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), March 2008, pp.329-339. [5] Zhang Y, Ford B. A virtual memory foundation for scalable deterministic parallelism. In Proc. the 2nd APSys, July 2011, pp.7:1-7:5. [6] Zhang Y, Ford B. Lazy tree mapping: Generalizing and scaling deterministic parallelism. In Proc. the 4th AsiaPacific Workshop on Systems (APSys), July 2013, pp.20:120:7. [7] Bienia C, Kumar S, Singh J P et al. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th PACT, October 2008, pp.72-81. [8] Reed E C, Chen N, Johnson R E. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proc. SPLASH, October 2011, pp.133138. [9] Liu T, Curtsinger C, Berger E. Dthreads: Efficient deterministic multithreading. In Proc. the 23rd SOSP, Oct. 2011, pp.327-336. [10] Aviram A, Weng S C, Hu S, Ford B. Efficient systemenforced deterministic parallelism. In Proc. the 9th OSDI, October 2010, pp.193-206. [11] Merrifield T, Eriksson J. Conversion: Multi-version concurrency control for main memory segments. In Proc. the 8th EuroSys, April 2013, pp.127-139. [12] Aviram A, Ford B, Zhang Y. Workspace consistency: A programming model for shared memory parallelism. In Proc. the 2nd WoDet, March 2011. [13] Thies W, Karczmarek M, Amarasinghe S. StreamIt: A language for streaming applications. In Proc. the 11th CC, April 2002, pp.179-196. [14] Olszewski M, Ansel J, Amarasinghe S. Kendo: Efficient deterministic multithreading in software. In Proc. the 14th ASPLOS, March 2009, pp.97-108. [15] Berger E D, Yang T, Liu T, Novark G. Grace: Safe multithreaded programming for C/C++. In Proc. the 24th OOPSLA, October 2009, pp.81-96. [16] Bergan T, Anderson O, Devietti J, Ceze L, Grossman D. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proc. the 15th ASPLOS, March 2010, pp.53-64. [17] Cui H, Simsa J, Lin Y H et al. Parrot: A practical runtime for deterministic, stable, and reliable threads. In Proc. the 24th SOSP, November 2013, pp.388-405. [18] Olszewski M, Ansel J, Amarasinghe S. Scaling deterministic multithreading. In Proc. the 2nd WoDet, March 2011. [19] Gordon M I, Thies W, Amarasinghe S. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In Proc. the 12th ASPLOS, October 2006, pp.151162.

4 ○ http://www.cilkplus.org/, Nov. 2014. 5 ○ http://www.nuget.org/packages/Microsoft.Tpl.Dataflow, Nov. 2014.

Yu Zhang et al.: System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

Yu Zhang is an associate professor in the School of Computer Science and Technology at University of Science and Technology of China, Hefei. Her research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. She is a member of CCF. Zhao-Peng Li is a postdoctoral researcher in the School of Computer Science and Technology, University of Science and Technology of China, Hefei. He received his Ph.D. degree in computer science from University of Science and Technology of China in 2008. His research interests include program verification, theorem-proving-based program analysis, and runtime systems. He is a member of CCF.

73

Hui-Fang Cao is a master student in the School of Computer Science and Technology at University of Science and Technology of China, Hefei. Her research interests include operating systems and parallel programming. this is a test this is a tesis is a test this is a tesis is a test this is a tesis is a test this is a tesis is a test this is a test this is a test this is a test this is a test

Suggest Documents