Reconfigurable Crossbar Switch Architecture for Network Processors

7 downloads 19407 Views 335KB Size Report
multiprocessors and computer clusters. The results ... crossbar switch in a network processor it could be achieved. Thus ... Some companies and respective.
In: IEEE International Symposium on Circuits and Systems, ISCAS 2006, Island of Kos, Greece, pp.4042-4045, May 21 - 24, 2006

Reconfigurable Crossbar Switch Architecture for Network Processors Henrique C. Freitas, Milene B. Carvalho, Alexandre M. Amaral, Amanda R. M. Diniz, Carlos A. P. S. Martins

Luiz E. S. Ramos Computer Science Rutgers University New Jersey, USA [email protected]

Graduate Program in Electrical Engineering Pontifical Catholic University of Minas Gerais Belo Horizonte, Brazil [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—This paper presents the proposal and development of a reconfigurable crossbar switch (RCS) architecture for network processors. Its main purpose is to increase the performance, and flexibility for environments with multiprocessors and computer clusters. The results include VHDL simulation of RCS and the use of it in a broadcast function implementation, found in message passing support middleware.

I.

INTRODUCTION

Until the end of nineties, network equipments normally used general-purpose processors. However, the need of quality-of-service (QoS) and the high speed of data transmission demanded a rapid evolution of network equipments. Thus, the Network Processor (NP) [1] [3] was created. Network processors took the place of some GPPs (General-Purpose Processors) and ASICs (Application Specific Integrated Circuits) in network equipments, targeting two important issues: flexibility and performance. As these features are essential to process the packets, a network processor is the best choice to get them. The main motivation is the necessity of increasing the two features cited before. With the use of a reconfigurable crossbar switch in a network processor it could be achieved. Thus, using network processor with a reconfigurable crossbar switch as interconnection structures, it is possible to increase the throughput and reduce the latency in communications with shared memory and message transference. Therefore, the main objective of this paper is to present the RCS-2, a reconfigurable crossbar switch architecture used to connect different inputs and outputs in interconnection and communication networks, and to implement dynamically topologies in two reconfiguration levels, described in section III.

The reconfigurable crossbar switch was described in VHDL (VHSIC Hardware Description Language) and implemented in FPGA (Field Programmable Gate Array). The results show the behavior of the application which contains communicating processes that perform a collective broadcast operation on the reconfigurable crossbar switch. II.

RELATED WORKS

There are lots of commercial network processors of different companies. Some companies and respective network processors [1] are: IBM (NP4GS3), Motorola/CPort (C-5 Family), Lucent/Agere (FPP/RSP/ASI), Sitera/Vitesse (IQ2000), Chameleon (CS2000), EZChip (NP1), Intel (IXP1200) and others. None of them presents reconfigurability, except the CS2000 of Chameleon [6]. However, it does not have a reconfigurable crossbar switch. NP architectures have dedicated blocks to execute specific functions as an embedded ASIC (NPSoC – Network Processor System-on-Chip). Some blocks are: PCI units, memory units, packet classifiers, policy engines, metering engines, packet transform engines, pattern processing engine, queue engine, QoS engines and other blocks. There are some documents and papers about crossbar switch, but nothing using reconfigurable crossbar in a network processor. The related works [2] [5] [10] present results of implemented crossbar switch on FPGA. The proposals present a new platform (FPGA) for implementing with reduced costs, flexibility, but they did not change the architecture core. The [2] work proposes a structure with an additional bus, but the reconfiguration is depended on the FPGA. The Flexbar [7] work proposes to modify the scheduler and network hardware levels, but the crossbar architecture core is similar to a traditional crossbar switch (TCS). The paper does not relate FPGA and the reconfigurable computing [8] as a feature of Flexbar. Nothing like the second level of reconfiguration, presented in the section III, appears in other works.

In: IEEE International Symposium on Circuits and Systems, ISCAS 2006, Island of Kos, Greece, pp.4042-4045, May 21 - 24, 2006

III.

RECONFIGURABLE CROSSBAR SWITCH ARCHITECTURE PROPOSAL

The figure 1 presents the R2NP (Reconfigurable RISC Network Processor) architecture [4]. The R2NP has been used as a base for the design of the reconfigurable crossbar switch architecture. Thus, the design of RCS-2 (Reconfigurable Crossbar Switch – 2 bits) was based on the use of it in a network processor.

Figure 1. R2NP Architecture

RCS-2, presented in figure 2, has three main blocks: (1) connection matrix, where the topologies are implemented; (2) decoder, that converts the reconfigurable bits for a matrix bits set and (3) pre-header analyzer (PHA). NPs can add a pre-header in the packet with the output destination.

reconfigurable crossbar switch (RCS-2) uses reconfiguration bits to implement the topology in the space. That topology actually maintains the created connections as a circuit. The reconfiguration bits set is capable of reconfiguring or implement a new topology in RCS-2 whenever necessary. RCS-2 architecture is based on two reconfiguration levels. Using these two levels it is possible to reconfigure and to readapt the crossbar switch to many network topologies and different workload situations. The first level is based on static reconfiguration using a reconfigurable device, like FPGA. Programming this device, it is possible to implement a RCS-2 with number of in and out ports (and consequently rows and columns – circuit and logic gates) limited by the device capacity. The second level of reconfiguration makes possible the implementation of different network topologies. It could be done by dynamically reconfiguration of the connection matrix nodes. These nodes determine which connections will be closed and consequently which paths exist through the crossbar switch. RCS-2 has two bits of reconfiguration to each node, which define the current topology. Only the Reconfiguration Unit and the instruction set of the network processor are able to change those bits in order to implement new topologies. Although one instruction can modify a reconfigurable bit, it only modifies the 01 and 10 formats (figures 2 and 3). The 00 and 11 formats are restricted to Reconfiguration Unit. This avoids undesired closing/opening operations on the connections. So, adding a second bit to represent connections between nodes permits the increasing of security, flexibility and performance. Figure 3 presents a Petri Net model of concurrent control for the reconfiguration of RCS-2. A connection can receive four types of decoded bits (00, 01, 10 and 11), but only 10 and 11 can close the connection to establish the communication (transition T3 is enabled). The R2NP ISA is capable to alter the implemented topology only if the register node receives from the reconfiguration unit the 01 and 10 decoded bits. In this example, the transition T2 cannot start because the place Node 34 does not have the tokens 01 and 10, but 11, and only the reconfiguration unit is capable to alter (T1 can start). The transition T3 starts if the Node 34 and Input 3 receive respectively 10 or 11 and I3 tokens.

Figure 2. Reconfigurable crossbar switch architecture

The reconfigurable crossbar switch (figure 2) has some connection nodes, which, if closed, compose a circuit. This circuit represents a topology in space. Differently from a traditional crossbar switch (TCS), where it is possible to close only one node per line or column, regards the implemented topology, the RCS-2 permits that more than one node can be closed per line or column at the same time. In a TCS, the topologies cannot be implemented in the space, only in the time. During a slice of time the same nodes must be opened and closed repeatedly to create the illusion of an implemented topology. On the other hand, the

Figure 3. Petri Net model of RCS-2 concurrent control

In: IEEE International Symposium on Circuits and Systems, ISCAS 2006, Island of Kos, Greece, pp.4042-4045, May 21 - 24, 2006

IV.

RESULTS

A. RCS-2 Implementation To allow the first level of reconfiguration, defined in section III, and to verify the proposal of reconfigurable crossbar switch, the RCS-2 architecture was codified using VHDL. Beyond the portage of code, it was necessary to build a parameterized code to achieve the first level. To allow the second level of reconfiguration, the decoder module was codified. This module receives three sets of bits: configuration type, address, and data. The configuration type determines which kind of configuration must be made: node, line or column. When it is a node configuration type bits, the decoder fills the configuration bits of a connection matrix node indicated by address with the data bits. When it is line or column configuration type bits, the entire line or column (all nodes) indicated by address bits is filled by data bits. In this way, these types of reconfiguration can be made in one reconfiguration time, configuring in parallel all nodes. This time is defined by the technology of the target device. The code and the architecture were verified through behavioral simulation using the ModelSim XE II from Model Technology. The times of an eight-input RCS-2 are summarized on table I. TABLE I.

TIMES OF RCS-2 IMPLEMENTED IN XILINX FPGA SPARTAN 2 XC2S200-6FG256

RCS-2 time interval between startup and the instant it is ready to be used Node configuration of a connection matrix RCS-2 port latency Overhead for transmit 1500 Bytes

23,137 ns 8,780 ns 12,895 ns 6012,9 ns

These times can vary depending on the target device and connection matrix dimensions that are the number of input and output ports. They were used to compare reconfiguration and transmission times. In the next subsection these times were used to analyze some applications using RCS-2. B. Evaluating Broadcast Patterns on RCS-2 In this section we present and analyze experimental results regarding the prototype implementation of RCS-2. The following results were obtained implementing and executing different versions of the broadcast function (operation) on top of RCS-2. This type of function is commonly found in message-passing support middleware, which is used in HPC (High Performance Computing) applications. Each version of the broadcast function was based on a simple communication pattern (algorithm), and all patterns were chosen since they are simple to understand and to implement [11]. We highlight that it is also possible to execute other types of communication functions and more complex communication algorithms on top of RCS-2. The test environment used is an eight-node cluster of workstations interconnected by an implementation of RCS-2 in FPGA. Each node has a 938MHz Pentium III processor and 256MB of RAM memory. The RCS-2 times presented in table I were used. In order to perform the tests in that cluster,

a parallel application, that performs a broadcast operation using different communication patterns, was used. The application is composed of eight processes, which were dynamically mapped into the nodes of the cluster (one process per processor). The best and the worst cases of mapping between application processes and the cluster’s processing nodes were considered [9]. The communication patterns upon which the parallel application is implemented are shown in figure 4. Sequential is a centralized pattern in which the root process (coordinator) sends its packets to all other N-1 processes. In Chain, process R1 ∈ [0,...,N-2] sends its messages to R2=(R1+1)modN and R3 ∈ [1,...,N-1] receives a message from R4=(N+R3-1)modN. In Binary, a given process R ∈ [0,...,N-1] has two children: R1=(2R)+1 and R2=(2R)+2 (only if are both R1 and R2 smaller than N). In this pattern, a given process forwards the messages received from its parent and then sends its own message to its children. In Binomial, the group of processes is recursively subdivided in two, so that the messages recursively reach all N processes in log(N1)+1 steps. The packets transmitted in all these patterns have a 1460 Bytes payload and a 40 Bytes header.

(a)

(b)

(c)

(d)

Figure 4. The broadcast communication patterns that we used: (a) Sequential, (b) Binary, (c) Binomial and (d) Chain [11].

The broadcasting of message with different sizes was tested, ranging from 1 byte to 256KB. Since all experiments yielded a similar behavior in the network, only the results regarding 256KB messages are presented. We initially tested all the versions of the broadcast application in RCS-2, which topology was statically configured as shown in figure 5. The topologies were chosen since they are simple to implement, describe and analyze.

Figure 5. Implemented network topologies: (a) Ring, (b) Star, (c) Switch/Bus, (d) Full-connected.

The total response time of the parallel application is shown in figure 6. In all tests, RCS-2 was statically configured, so that R2NP implemented the desired topologies. This type of configuration simulates the network devices using traditional crossbar switches. Analogy the workload simulates the traditional broadcast operation implemented in regular communication middlewares. In this case, the reconfiguration overhead is worthless. Analyzing figure 6 it is possible to notice that Binomial presents the best results in most of the cases, except for the

In: IEEE International Symposium on Circuits and Systems, ISCAS 2006, Island of Kos, Greece, pp.4042-4045, May 21 - 24, 2006

Ring and Star topologies. We remark that RCS-2 can be reconfigured to perform the broadcast operation in 1.08ms. This result represents a speedup of 3 in comparison with the best communication pattern running on the network with the best topology. The result also indicates a mean speedup of 11.5 in comparison with all measured response times. Although many ASICs can reach similar performance results, they usually demand specific protocols and have very little flexibility.

Figure 6. Response time (ms) of the parallel (broadcast) application.

Suppose a workload composed of 100 calls to the broadcast function. Consider also that we use four different middlewares supporting message-passing. The processes of the application are mapped according to the worst case. One of them is based on the Chain pattern and used 70% of the calls. The other middlewares use the others 3 patterns described, so each pattern used 10% of calls.

Figure 7. Execution Time of the Workload on TCS and RCS-2

Figure 7 presents the execution time of the entire workload on top of traditional crossbar switches (TCS) and on top of RCS-2, including the reconfiguration time. The figure shows that even with the reconfiguration overhead incurred in the execution of the application, RCS-2 yielded a result that is very close to the best topology in that case (the Full-Connected topology). On the other hand RCS-2 also yields the best results, but the best topology is the Bus in this case. Therefore, RCS-2 presented the best performance due to its flexibility.

V.

CONCLUSIONS

The developed reconfigurable crossbar switch architecture presented advantages due to its flexibility and high performance. This fact justifies its employment in a network processor. The capability of adapting the topology implemented on crossbar switch to the environment changes generates high performance for data processing in several situations as multiprocessors and computer clusters. The contribution of this paper is the proposed RCS-2 architecture. The first level of reconfiguration of the RCS-2 could be reached through the codification of the architecture using a hardware description language, allowing it to be implemented in several devices with dimensions determined by device capacity. The second level of reconfiguration could be reached with modifications in the matrix of connections. These modifications generate an overhead. However, through the experiments, it was evidenced that the overhead time is less than the speedup obtained through the topologies implementation in RCS-2. Therefore, the RCS-2 has a better performance when compared to a TCS. As future work, intermediate cases will be analyzed, where the benefits of reconfiguration will be more intensely highlighted. REFERENCES [1]

D. E. Comer, “Network Systems Design using Network Processors”, Prentice Hall, 2003 [2] D. Kim, K. Lee, S. Lee and H. Yoo, “A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip”, IEEE International Symposium on Circuits and Systems, 2005, pp. 2369 - 2372 [3] G. Lawton, “Will Network Processor Units Live up to Their Promise?”, IEEE Computer, Volume 37, Number 4, April, 2004, pp.13-15 [4] H. C. Freitas and C. A. P. S. Martins, “Didactic Architectures and Simulator for Network Processor Learning”, Workshop on Computer Architecture Education, San Diego, CA, USA, 2003, pp.86-95 [5] H. Eggers, P. Lysaght, H. Dick, and G. McGregor, “Fast Reconfigurable Crossbar Switching in FPGAs”, Proceedings of 6th International Workshop on Field-Programmable Logic and Applications, Springer LNCS 1142, 1996, pp.297-306 [6] I. A. Troxel, A. D. George and S. Oral, “Design and Analysis of a Dynamically Reconfigurable Network Processor”, IEEE Conference on Local Computer Networks, November 6-8, 2002, pp.483-494 [7] J. Chang, S. Ravi, and A. Raghunathan, “FLEXBAR: A crossbar switching fabric with improved performance and utilization”, IEEE Custom Integrated Circuits Conference, May 2002, pp.405-408 [8] K. Compton and S. Hauck, “Reconfigurable Computing: A Survey of Systems and Software”, ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171-210 [9] L.E.S. Ramos and C.A.P.S. Martins, “A Proposal of Reconfigurable MPI Collective Communication Functions”. Third International Symposium on Parallel and Distributed Processing and Applications, LNCS 3758, Nanjing, China, November 2-5, 2005, pp. 102-107 [10] S. Young, et al., “A High I/O Reconfigurable Crossbar Switch”, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, Califormia, April 09-13, 2003, pp.3-10 [11] S.S. Vadhiyar, G.E. Fagg and J. Dongarra, “Automatically Tuned Collective Communications”, ACM/IEEE SuperComputing, 2000, P.3

Suggest Documents