A New Parallel Processing System for Commercial

SPAX: A New Parallel Processing System for Commercial Applications Woo-Jong Hahn, Kee-Wook Rim Electronics and Telecommunications Research Institute, Processor Section POB 106, Yusong, Taejon, Korea fwjhan,[email protected] Abstract In this paper, a new parallel processing system for commercial applications, so called SPAX, is described. SPAX cost-effectively overcomes the SMP limitation by providing scalability of the parallel processing system and application portability of the SMP. We also describe a new system network, so called XcentNet, which interconnects hundreds of multiprocessor PC boards in SPAX. It is a hierarchical network that provides incremental scalability with minimum re-wiring when the user’s requirement is changed. This is based on the low latency crossbar routers, which build up a router-cloud and provide up to 2.67 Gbytes/sec/router-cloud of bandwidth. We briefly describe the preliminary evaluation result that shows Xcent-Net shall not be the bottleneck in the system running a typical commercial application.

1. Introduction A parallel architecture is continuously evolving to fulfill the increasing need for more processing power. Most parallel processing systems have focused on the computationintensive application. However, there is increasing interest in applying the parallel processing systems on the commercial applications while the commercial applications require more processing power [12, 4]. For the large commercial applications including newly emerging multimedia applications, a GMSV(Global Memory Shared Variable) architecture, such as a bus-based SMP, is usually adapted for the application server. This architecture has sharp limit on scalability. On the other hand, DMSV(Distributed Memory Shared Variable) architecture still has technical issues to overcome the long latency problem [3]. To provide both scalability and programmability, DMMP(Distributed Memory Message Passing) with SVM(Shared Virtual Memory) on top of it can be more practical solution.

Soo-Won Kim Korea University Department of Electronics Engineering Anam-dong, Seongbuk-ku, Seoul, Korea

NOW(Network of Workstations) architecture provides DMMP model cost-effectively. It is less expensive and supports lots of commercial applications with minimum rework [10]. Moreover, NOW naturally provides incremental scalability that is another important requirement of the commercial application users, who tend to gradually upgrade the system from the inexpensive entry system. While the relatively inexpensive PC becomes competitive to the more proprietary workstations in terms of performance, it is more interested in incorporating PC boards rather than workstations. On the basis of this, a cluster architecture based on the PC boards with high bandwidth interconnection network is one of the promising architecture. A central issue in building a scalable parallel machine including PC cluster or MPP is the design of the interconnection network [6]. Most interconnection networks developed for the current parallel processing systems have regular and uniform structure over the whole system [5, 2]. Their performance highly relies on the data parallelism of the scientific applications. However, most commercial applications, such as OLTP, do not have much data parallelism. Hence, PC or workstation clusters mostly rely on LAN. Among LANs, Ethernet is the most popular and widely used. Current Ethernet, even 100base one, does not provide enough bandwidth for clustering nodes that are working as a single system. It fits well to the communication between independent systems, such as transferring a file to another system once a while. But its long latency makes it inefficient to support frequent communications. ATM as LAN is emerging these days and it provides higher bandwidth than conventional Ethernet. But it has inherent deficiency of being system network since it was originally developed for long-haul network. Even though many efforts are resolving problems in supporting, for example, broadcasting on a point-to-point protocol and controlling the bursty data flow, its additional layer for LAN involves additional latency. ATM becomes a very strong candidate as LAN is linked to wide-area ATM network. The wide-area ATM network, however, is yet to come. It diminishes the

benefit of ATM LAN. We designed a new interconnection network for a clusterbased parallel processing system, which is called XcentNet. In this paper we explore requirements, specification and characteristics of the network. We also describe implementation of it. We show it shall not be a bottleneck in the system running a typical commercial application by the event-driven simulation using a TPC benchmark model. We introduce the new parallel processing system based on Xcent-Net, which is called SPAX(Scalable Parallel Architecture based on Xcent-Net). It aims at a cost-effective parallel processing system for commercial applications. To achieve the goal, it utilizes the existing design of multiprocessor PC board and we have developed a proprietary network interface to interconnect them through Xcent-Net. This system is scalable up to hundreds of processors, which is well beyond the today’s SMPs, while maintaining incremental scalability [12]. Following this introduction, section 2 describes the architecture and design of SPAX, while section 3 describes Xcent-Net. Section 4 shows preliminary performance evaluation of the network traffic. Finally, the conclusion and the status of work shall be in section 5.

2. SPAX We have been developed SPAX for the nation-wide information network server in Korea. As its primary target application is OLTP, we summarize the basic requirements of SPAX below.

Hybrid architecture that can explore the advantages of both NOW and MPP Portability and adaptability of the commercial applications, such as a parallel DBMS and OLTP applications, running on a SMP Higher bandwidth and lower latency, comparing against the conventional LANs, in intra-system communication Incremental scalability from a few processors Modularity and flexibility in system configuration Practical packaging which is compact and air-cooled High availability

It becomes a hierarchically clustered architecture to provide characteristics of SMP and MPP at a different level. The hierarchical architecture is incrementally scalable in nature. The clustered architecture gives proper modularity and flexibility in configuring the system. As it can be

partitioned into several clusters, we can provide a separate, air-cooled cabinet that holds one or two clusters with disks. To achieve high bandwidth and low latency, it is necessary, but not sufficient to design a proper interconnection network. The system should provide an efficient way to access the network without much overhead, The SPAX operating system provides a special driver to access the network. Figure 1 shows the proposed architecture of the parallel processing system for commercial applications. Another part of Xcent-Net (Cluster_8 to 15)

Xcent-Net Router Cloud XCENT

Router Cloud

XCENT

XCENT

XCENT

XCENT

Cluster_1 to 6

XCENT

XCENT

XCENT

Cluster_1 to 6

Cluster_7

Cluster_0 Router Cloud

Router Cloud

XCENT XCENT XCENT XCENT

XCENT XCENT XCENT XCENT

P

P

P

P

M

M

M

M

P

P

P

IOC IOC IOC

P IOC

Ethernet FDDI X.25

P

P

P

P

M

M

M

M

P

P

P

IOC IOC IOC

ISDN

P IOC

Ethernet FDDI X.25 ISDN

Figure 1. Architecture of SPAX A maximum configuration consists of 16 clusters and each cluster has up to 8 nodes. Each node can be a quadprocessor user node or a single-processor IO node. As a normal configuration, it supports up to 256 user processors and 64 IO processors by having 64 user nodes and 64 IO nodes. Since the network, so called Xcent-Net, does not identify the node type, user can select various configurations by substituting user node with IO node or vise versa. The minimum system may contain only one user node and one IO node, and user can add up nodes one by one to get the processing power or storage capacity required. All hierarchies use the same network and we do not need a protocol conversion which most SMP-clusters need. It gives an efficient, low latency communication within the whole system. It has a micro-kernel based UNIX that is being developed at ETRI. Each node has the same copy of the micro-kernel, but may have different server on top of it. The user nodes have processing servers such as the process manager or interprocess communication manager, or coherency manager, etc. Even each user node does not necessarily have all of the processing servers. They can be distributed over the user nodes and reassigned dynamically. IO nodes have IO servers which includes, but not all of, the block-IO manager

or disk cache manager, or configuration manager, etc. SPAX consists of many identical components and is very modular. The system is designed to eliminate potential single-point of failures such as loss of a processor, loss of a network, or loss of a disk drive. Xcent-Net supports dualpath to every node through the duplicated hierarchical crossbar network. To make the system more cost-effective, both paths of Xcent-Net are in active for fault-free situation. This embedded redundancy provides reasonable fault-tolerancy with minimum performance degradation [1].

3. Xcent-Net 3.1. Requirements A central issue in designing SPAX is an interconnection network. We explore requirements for Xcent-Net as the interconnection network of SPAX below.

Reasonable bandwidth allocation for the inter-cluster communication channel and the intra-cluster or internode channel Less sensitive as possible on the application softwares Simple and globally uniform protocol Easy of packaging Network partitionability

it becomes more complex, it becomes harder to test in real implementation environment. Packaging constraint is another important issue, even though it has been overlooked by most academic researchers. It includes board layout, connectors, pin assignment, and wiring. With given ASIC technology, the number of pins and pads are limited. It is a trade-off between topology, or latency of intra-cluster network, and channel width. Even after that trade-off is made, it is important to assign signals on the pins of ASIC and connectors as it significantly affects on signal integrity and noise. It is again a trade-off between the operational speed and channel width. Also considering maintenance and incremental scalability, it must be easy of rewiring. Unlike the scientific applications, most commercial applications have many small jobs having their own IO operations. To have the network partitioned efficiently, we need to distribute the IO nodes over the entire network. Also the commercial applications requires reliable environment and we need to remove a single-point of failure on the network.

3.2. Xcent-Net Design Concerning the design factors described above, we choose a crossbar network for the intra-cluster network. It becomes a dual hierarchical crossbar network since we use the same network for inter-cluster network to avoid the protocol conversion overhead. Its design specifications are summarized in Table 1. The detailed description shall be

High availability

Considering bandwidth allocation, we do not expect as many communications on the inter-cluster channels as those on the intra-cluster channels, since each cluster has enough processing power and storage capacity to handle a fairly large user job. Moreover not many commercial applications have enough data parallelism to utilize the whole network at a given time. On the basis of this, we allocate four times more bandwidth on the intra-cluster network than inter-cluster network, which still provides higher bandwidth than ATM or even Fibre-Channel. Application performance often differs upon the topology. For example, HiPi+Bus [11] shows little preference to a specific application but the ring topology does not since the latency is quite different upon the location of the requester and the responder. While it is hard to provide uniform latency over the whole system, we need to have intra-cluster network as uniform as possible. One of the our major design goal is providing the simple and seemless, global protocol for both inter-cluster and intra-cluster network. Otherwise we shall suffer the protocol conversion overhead, which increases the latency significantly, regardless of the peak bandwidth. Moreover, when

Table 1. Xcent-Net Specification Topology Router Nodes Channel Width Operation Freq. Bandwidth(agg.) Maximum Distance Fault Tolerancy Routing Clock Scheme

Xcent-Net Dual hierarchical crossbar net 10 10 Crossbar router (8:2) 128 nodes (16 clusters) 32-bit/channel 33.3MHz(long), 66.6MHz(short) 34.1 Gbytes/sec at 33.3MHz 33 feets between routers Embedded redundancy Node isolation Virtual Cut-through routing Adaptive path control Plesiochronous clocking Source synchronous transmission

found in [7] and summarized below. 3.2.1. Message We define two types of messages which are control messages and data messages. The control message size is from 4 bytes

the number of tags is determined by the number of router stages it is to pass through. Figure 2 shows this use-and-drop scheme for the source-based routing.

Internode

Intercluster

Sender

Intercluster

Internode

to 64 bytes and it is used to transfer the resource or process control information between nodes. The data message size is typically the same as page size but can be up to 1 Mbytes. The data message can increase the traffic on the network since it consists of tens or more of packets. According to the preliminary evaluation, Xcent-Net does not become a bottleneck for the typical page size [12]. To support two different message types efficiently, we support both DMA-based and memory-mapped transfer methods. We reserve the memory space for control messages, which are stored or retrieved to or from the dedicated buffer memory in the network interface. Once it is stored in the dedicated memory, the network interface automatically sends it to the destination. As it is not efficient to move large data messages to or from the dedicated memory by the processor, we provide the special DMA that partitions the large data in the shared memory into packets and moves to the network interface or vise versa.

Reveiver

Tag Tag

Tag

Tag

Tag

Tag

Tag

Tag

Tag

Tag

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Data

Data

Data

Data

Data

Figure 2. Use-and-Drop scheme in Xcent-Net

3.2.2. Packet 3.2.4. Flow control Xcent-Net uses the clear-to-send flow control. A router or node confirms availability of the destination’s packet buffer before it sends the packet. When the destination is available, the input packet flow through the router to the destination even before the entire packet is arrived. We adapt the source synchronous transmission method to transfer the flits to the adjacent router. It avoids the skew problem of global clocking and allows each router cloud operates at its own clock.

Routers

Cluster 7

Cluster 8

Routers

Routers

Routers

Routers

Cluster 15

N120

Routers

N71

Routers

N64

Routers

N56

Routers

level 1

Routers

Routers

Routers

N127

Routers

Cluster 0

N0

Xcent-Net uses the virtual cut-through routing to shorten the network latency and minimize blocking other messages when one is passing through the network. It is determined by compromising network efficiency and latency at a given physical environment, such as the neighbor routers being 33 feets apart. When the distance between the routers is not short enough, the propagation time of the handshaking signals can not be ignored and the total latency increase is not tolerable. It has a source-based routing in which the source node attaches the tag flits to each packet to designate the destination. It is flexible, as the routing policy can be changed by modifying the route determination algorithm in the sender. To increase network efficiency, the channels to the upperhierarchy incorporate adaptive routing. As a result, XcentNet implements a source-based adaptive routing. Each packet can have multiple tags, which depends on how far the destination is. The routers use a proper tag and drop it when the packet is transfered to the next router. So

Routers

N63

3.2.3. Routing

N7

Xcent-Net is a packet-switched network and the packet is a basic unit for the communication. A packet consists of 3 flits at minimum and up to 21 flits including 16 information flits. In Xcent-Net, the maximum pay-load of a packet is determined by compromising the allowable clock skew, latency involved to synchronize a packet, and the maximum operation speed and the distance between routers. Restricting the packet length also improves fairness in using the channels. Each packet has one or more tags and one control flit which contains control information such as a packet sequence number used by the network interface.

level 0

Routers

Routers

level 1

Figure 3. Xcent-Net Figure 3 shows the overall Xcent-Net architecture. A router cloud consists of four 8-bit, 10 10 routers, so called Xcent, to provide 32-bit path. Each cluster has dual router clouds for high availability. Eight channels of the cloud are

for the nodes in the cluster and the remaining two channels build up the inter-cluster network path. The Xcent-Net interface developed has PCI interface to communicate with the mother board. The PCI interface allows Xcent-Net can be adapted as a high performance interconnection network for virtually every system or boards that has PCI interface. Eight routers forming dual router clouds are installed on a single backplane, which builds up one cluster with nodes.

We have built a simulation model that mimics the node to verify the functions and timing of the router cloud. The backplane is built to be an impedance controlled, 16layer PCB to minimize signal distortion. We have analyzed and controlled the skew on the backplane signals of the same port to be less than 0.3 ns. We have selected drivers for the backplane and cable after careful analysis and SPICE simulation [9]. On the backplane, Xcent directly drives all signals using 4 mA TTL drivers and discrete LVDS drivers are used for the cables.

3.3. Implementation Most of the Xcent-Net functions are implemented in the Xcent router. Xcent consists of separate input and output ports, arbitration logic, 10 10 crossbar, buffer memories, and global control logic. Figure 4 shows the block diagram of Xcent. The input port synchronizes the incoming packet with the local clock. The incoming packet is sampled with the incoming Sync signal and synchronized with the local clock in the packet buffer. The output packet buffer usually shows better efficiency than the input buffer. But, we adapt the input buffer scheme for Xcent since it can naturally synchronize the incoming packet, which is, otherwise, done with dedicated, additional latches. Input Control

Xcent Core Buffer

Local

Dequeue & Flow Control

Arbitor

Enqueue Control

Synchronizer

Data #0 Valid #0 Sync #0 Svalid #0 Ready #0

Output Control Data #0

Synchronizer

Valid #0 Sync #0 Sready #0 Ready #0

4. Evaluation We have done a preliminary simulation on Xcent-Net and the system to understand the most probable bottleneck in the system. The simulation is driven by a synthetic transaction workload [8]. We depict the TPC-B as a workload model, which is an IO intensive workload. Rather than executing the real codes of each transaction, messages and data transfers required by the transaction are modeled. Since this simulation focuses on the evaluation of the interconnection network connecting PC boards with PCI interface, the simulation model includes the details of the network interface of the node. Other components of the system such as the processors, memory systems, disk operation and PCI operations are parameterized in terms of clock cycles. There are three essential constituents in our simulator, which are processes, shred resources, and events.

10 X 10

Table 2. Constant parameters

Crossbar

Number clusters 8

and

Arbitor

Dequeue & Flow Control

Local

Buffer

Synchronizer

Clock/Reset Master

Enqueue Control

Data #9 Valid #9 Sync #9 Svalid #9 Ready #9

Data Path Controller Synchronizer

Data #9 Valid #9 Sync #9 Sready #9 Ready #9

of

Transactions per Teller 100

Packet (bytes) 64

size

Table 3. Buffer Utilization

Clock/Reset & Global Arbitor & Misc

Figure 4. Block Diagram of Xcent Ready and Valid signals are flow control signals. Xcent drives Ready when its input buffer is empty and sends the packet with Valid, which notifies the packet on the Data lines is valid. There are two types of arbiters; local arbiter and global arbiter. Each output port has an own local arbiter, and the local arbiter selects which packet flow through the output port. The global arbiter performs arbitration for broadcasting, which needs all output ports. We have used Verilog hardware description language to capture the Xcent design and synthesized for 0.6 gatearray.

IO Prob

Disk cache miss rate

0.3 0.3 0.3 0.5 0.5 0.5

0.5 0.75 0.9 0.5 0.75 0.9

Avg. data buf. Q length 8.177 9.202 9.858 98.437 128.267 131.064

Level1 Xcent buffer util. (PN) 0.006 0.006 0.006 0.008 0.008 0.007

level1 Xcent buffer util. (ION) 0.267 0.262 0.255 0.607 0.544 0.518

Level1 Xcent buffer util. (ALL) 0.118 0.116 0.113 0.265 0.237 0.226

Level2 Xcent buffer util. (ALL) 0.005 0.005 0.005 0.005 0.005 0.005

The processes are the active entities that actually execute

the functions defined in the simulator or, in other words, workloads. The processes may invoke subprocesses when it is necessary. The simulator controls the creation and termination of processes. A number of processes compete with each other to use the shared resources such as packet buffers in the network interface. Events are used for the purpose of synchronizing processes. Table 2 and table 3 shows buffer utilization for the Xcent routers and nodes. It appears from Table 3 that the Xcent buffer utilization is not high to be a performance limit in various disk cache miss ratios. On the other hand, data buffer on the IO node(ION) can be a bottleneck when we have enough number of disks per node. Utilization of the Xcent buffer for the IO node is significantly higher than that of the processing node(PN). This is because that the processing node can not generate many requests to the network due to the IO node holding return-messages in its buffer. Overall Xcent-Net shall not be a bottleneck even when we have enough number of disks per IO node.

5. Summary We have designed and implemented a new parallel processing system based on a dual, hierarchical crossbar network. We have analyzed the requirements and design alternatives in designing a parallel processing system for commercial applications. The system supports up to 16 clusters or 128 nodes based on the multiprocessor PC board. Each node can have up to four PentiumPro processors and two PCI interfaces. IO nodes can be distributed and mixed with processing nodes in any combination. It makes the system configuration flexible to meet the various user’s requirement. We avoid a singlepoint of failure in processors, interconnection network, and disks. The interconnection network of the system, Xcent-Net, is also presented. It is a virtual cut-through network and provides high bandwidth and availability. It provides 267 Mbytes/sec bandwidth per channels. The router of XcentNet is a byte-wide 10 10 crossbar router. Four routers build up a 32-bit intra-cluster network for a cluster and the same router cloud is used for the inter-cluster network. Our preliminary evaluation shows that the network shall not be the bottleneck in commercial applications, such as OLTP application. The system and network is working with the micro-kernel based UNIX, which is also developed in this project. A DBMS engine shall be ported to verify the system. After the verification, we shall measure the dynamic behaviors and performance on the system. It is to be used as a nation-wide information network server.

Acknowledgement This work was performed as a part of ’Highly Parallel Computer Development Project, funded by MIC and four major companies, Korea. Authors would like to thank researchers involved in this project, especially members of Processor Section for the work of Xcent-Net and nodes, and members of USC for the work of evaluating SPAX.

References [1] B.J.Min, S.H.Shin, and K.W.Rim. Design and analysis of a multiprocessor system with extended fault tolerance. In Proc. of IEEE 5th Workshop on Future Trends in Distributed Computing Systems, pages 301–307, August 1995. [2] C.B.Stunkel and et. al. Architecture and implementation of Vulcan. In Proc. of 8th IPPS, pages 268–274, April 1994. [3] E.E.Johnson. Completing an MIMD multiprocessor taxonomy. ACM SIGARCH Computer Architecture News, 16(3):44–47, June 1988. [4] Gartner group. Second annual advanced technology groups conference, 1992. [5] Intel Co. Paragon XP/S product overview, 1991. [6] J.Kim. Performance study of packet switching multistage interconnection networks. ETRI Journal, 16(3):27–41, October 1994. [7] K.Park, J.S.Hahn, S.W.Sim, and W.J.Hahn. Xcent-Net: A hierarchical crossbar network for a cluster based parallel architecture. In Proc. of 8th PDCS, October 1996. [8] R. Mark and E.Rahm. Performance evaluation of parallel transaction processing in shared nothing database systems. In Proc. of PARLE, pages 295–310, May 1992. [9] S.W.Sim and A. Martinez. Switching router network in a parallel computer. To be appeared in proc. of Design SuperCon’97, January 1997. [10] T.E.Anderson, D.E.Culler, and D.A.Patterson. A case for NOW. IEEE Micro, 15(1):54–64, February 1995. [11] W.J.Hahn, K.W.Rim, and S.W.Kim. A multiprocessor server with a new highly pipelined bus. In Proc. of 10th IPPS, pages 512–517, April 1996. [12] W.J.Hahn, S.H.Yoon, and K.W.Rim. Design of the new parallel processing architecture for commercial applications. Journal of Korean Institute of Telematics and Electronics, 31(5):897–907, May 1996.