SDN in High Performance Computing for Scientific

0 downloads 0 Views 248KB Size Report
instructions. With some micro-architectural enhancements especially for 128-bit operation, enhanced AVX2 brings integer SIMD to the full YMM registers and ...
2017 International Conference on Computational Intelligence in Data Science(ICCIDS)

SDN in High Performance Computing for Scientific and Business Environment (SBE) Sabbi Vamshi Krishna ECE Department Godavati Institute of Engg & Tech., Rajamundhary, India [email protected]

Dr.Azad Shrivastava Head, HPC & OSPD Division Aura Emanating Teknology Pvt. Ltd New Delhi, India [email protected]

Abstract—In recent scenario, need of computing is shifting towards High Performance Computing (HPC) not only for scientific applications but in commodity applications also. HPC cluster system allows HPC users to analyze and solve complex problems that are associated with large data size, such as big data simulation and big scale scientific simulation for quick analysis. It is highly required in the domain of science and industry for enhancing accuracy. Today, cluster system approach in demand for HPC systems rather than main frame system, cluster is a system of multi-core multiple computers interconnected in network. HPC apart from interconnect includes multiple standalone computers (PCs, Workstations, or SMPs) are based on processor technologies (like Intel, IBM P6, SUN SPARC etc.) , RAM, and motherboard architecture, further operating systems, middleware, parallel programming environments enhance it. In this paper, key factors like processor micro architecture( Ivy Bridge, Haswell), inter-core communication performance, inter-socket communication performance (Quick Path Interconnect) and inter-rack communication through 10 Gbps and Infinband’s DDR and FDR performance and challenges are reviewed to improve and meet businesses need with incorporating Software Defined Networking (SDN) in HPC scenario. Index terms—inter-core, inter-socket, QPI, inter-rack, infiniband DDR, FDR, SDN I. Introduction In present scenario, computing need is shifting towards High Performance Computing (HPC) not only for scientific applications but in commodity applications also, resulting immediate players as individuals/system integrators and Original Equipment Manufacturers (OEMs). HPC support researcher and developers to solve, analyze complex critical problems decision associated with large data, such as scientific simulation on large scale data size and types in the monarch of Scientific and Business Environment[30],[32]. Generally, these problems are used to solve by mainframe computers but currently due to large size of data having verity of sources that generate it, are big to compute and elongated to incessant results are seen on a commodity desktop computer or workstation so HPC has become an imperative technology and technique for large scale computations on big data. Today’s HPC systems are integrated as a cluster system, which is a

978-1-5090-5595-1/17/$31.00 ©2017 IEEE

Dr. Sunil J. Wagh Vice Principal MGM College of Engg & Tech., Noida, India [email protected]

system of multiple interconnected machines in a switched network. Interconnect is the key of development and research for enhancing the performance of HPC cluster [31],[33]. The distributed and high-performance applications necessitate high computational power, high speed and bandwidth communication execution. In recent years, processor's microarchitectural development, the computational power of microprocessor increased in many fold in last few decades. Concomitantly, very low latency and very high bandwidth network interconnects are emerged. This is a optimistic trend for building high performance clustered computing environments by combining the computational power of commodity Personal Computing devices (Desktop, Laptop, Hand-held device) with the communication performance of high-speed network interconnect. HPC apart from interconnect includes multiple standalone computers (PCs, Workstations, or SMPs) are based on processor technologies (like Intel, IBM P6, SUN SPARC etc.) , RAM, and mother board architecture, further operating systems, middleware, parallel programming environments enhance it . In past few years, computation power of processors have been increased in many folds and nonetheless, need of low latency and high bandwidth interconnect has been realized and emerged to build high performance computing environment. Interconnection technologies play crucial role in improving price performance [17],[38]. Network adapters and switches connect nodes in a HPC cluster. At each node, a stack of communication protocol layer provides functional compatibility for inter-node communication mean while upper layers' protocols like MPI, PVFS, are made to take advantage of communication layer to endure user applications. One of the key challenges for a cluster designer is to selection of interconnects that best cope with the applications’ communication requirements. Because applications can have very different communication characteristics, so meticulous knowledge of different interconnects and their performance are required, to know how they perform under these communication patterns [37] [38]. II. Ivy Bridge Micro-Architectural Features Comparing with previous “tick phase” Sandy Bridge microarchitecture, Tock phase Ivy Bridge micro-architecture has few enhancements and it is the first to use Intel's Tri-Gate transistors. Xeon E7 v2 generation family supports up to 15 computing cores per chip without Integrated Voltage

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) Regulator (IVR) supported and memory capacity enhancement is done up to 1.5TB per socket that is three times the of the previous sandy bridge generation, With the Xeon E7 v2 line processor, Intel has also introduced Intel Run Sure Technology, aimed at enhancing the Reliability, Availability and Serviceability (RAS) of the Intel’s Xeon platform computation [1], [6]. It delivers up to double the average performance of previous versions. Ivy Bridge microarchitecture is almost similar to Sandy Bridge microarchitecture but with few micro-architectural enhancements like Next page prefetcher, Zero-latency move operation and improvement in ROR and ROL instruction [6], [7]. Apart from mentioned enhancement, there is Front End Enhancement: If one logical processor is not active, then a single thread executed on that processor core can support 56 entries in the micro-operation queue and Latency and throughput of some instruction have been improved like 256-bits Floating Point Divide and Square root operation. Improved performance in HPC supported with expanding AVX’s floating point operation from 128 bits to 256 bits and it is effectively just doubling the throughput of HPC environment along with allowing Intel’s Ethernet controller and adapters to talk directly with processor cache, effectively reducing Ethernet related latency [10]. Ivy-Bridge micro-architecture has improved Machine Check Architecture (MCA) features, like MCA Recovery Execution Path, MCA I/O and PCI Express Live Error Recovery (LER). It has extended software-assisted error recovery that includes uncorrectable data errors which provides uncorrected I/O errors information to the OS. It enables recover from PCI Express bus errors. In addition to the Intel Quick Path Interconnect (QPI) links, It made Big change as improvement is On-chip Direct PCI Express (Gen 3.0) of the Xeon E7 v2 and to make connection between each CPU peripheral devices, the chips supporting 128 lanes of Input/output. Intel statistics claims 4x I/O bandwidth improvement, . As a other micro-architectural improvement, new memory controller configurations has introduced, that includes two Scalable Memory Interconnect (SMI) Gen 2 links per home agent/memory controller and they are capable to supports a total of four links per processor socket. The chip now supports up to 1.5TB of RAM per socket, which means a 4S configuration reaching at 6TB and a 8S system at 12TB. A 3x improvement is showed over the previous generation which is enabled by supporting more DIMMs (16 vs. 24 per socket) at great capacity (32GB vs. 64GB) (see Table I). The CPU comes with three Quick Path Interconnect (QPI) links at up to 8GT/s speeds. The QPI links are also utilized more efficiently using a home snoop protocol. It basically reduces the number of communications while a CPU asks for data that is neither in its own cache nor local RAM and hence provides better scalability at a minor increase in latency. The Xeon E7 v2 also finally introduced PCI Express 3.0 to the enterprise server segment. Each CPU comes with 32 PCIe lanes that can be flexibly configured. That definitely improves the I/O capabilities tremendously. Apart from these brute force improvements, Intel also claims that they reduced the latencies and improved the direct PCIe to PCIe bandwidth (see Table I).

Another four PCIe 2.0 lanes are available for the Direct Media Interface (DMI) that connects the CPU to the chipset. The machine interface includes two on-chip DDR3 memory controllers that each controller have two memory channels that support 800MT/s to 1867MT/s effective frequencies for traditional DDR3 modules together with up to 2667MT/s speeds, to connect to a memory extension buffer using voltage-mode single-ended (VMSE) interface. It consequently supports multiple system topologies. The Ivytown version processor’s high-speed serial I/Os support up to 40 lanes of PCI Express (2.5/5.0/8.0Gbps), four lanes of Direct Media Interface (DMI) (2.5/5.0Gbps) and also 60 lanes of QPI (6.4/7.2/8.0Gbps) interface to connect with other central processing units (CPUs). Branch Predictors

Instruction Fetch Unit L1-ITLB 32 KB L1 I-Cache (8-way)

168 Pre-decode, fetch Buffer 6 Instructions 2 x 20 Instruction Queue Complex Decoder

µ-code

Simple Decoder

4 µ-ops 4 µ-ops 1.5K µ-op Cache (8-way)

Simple Decoder

Simple Decoder

4 µ-ops 4 µ-ops 2x28 µ-op Decode Queue

1 µ-ops

4 µ-ops 168- Entry Record Buffer (ROB)

160 Integer Register

144 AVX Register

44 Entry Branch Order Buffer

64 Entry Load buffer

36 Entry Store Buffer

54 Entry unified Scheduler Port-0

Port-1

ALU 256-bit Branch VMUL Shift Vshift 256-Bit FMA F Blend

256 Bit FADD

Port-5

ALU LEA MUL

ALU Fast LEA

Port-2

256 128 Bit Bit V MUL F Shuffle V Shuffle F Blend

L2 TLB

Port-4

Port-3

256 Bit V ALU V Shuffle 64 Bit AGU

64 Bit AGU Store Data

L1 DTLB 32 KBL1 D-Cache ( 8way ) 256 KB L2 Cache ( 8-Way)

Fig1: Ivy-Bridge Micro-Architecture

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) III. Haswell Micro-Architectural Features Comparing with Ivy Bridge Micro-architecture, the Haswell micro-architecture (figure-2) is Intel’s tick phase development. Here, Intel has done several micro-architectural intensification in core and uncore design of processor. The execution engine and out-of-order engine resources of processor have been improved and modified extraordinarily to support extraction of greater degree of instruction parallelism (Table I). Providing enough data to the execution units, not only the L1D bandwidth has been doubled but L2 cache bandwidths also ( from 32 Bytes/Cycle per core to 64 Bytes/ cycle per core). Moreover, the Haswell-EX (Xeon E7-xxxx v3) line's on-die Integrated Memory Controller (IMC) supports DDR4 which has increased bandwidth of memory that shown 2% of improved latency compared to DDR3 and greater power saving in HPC clusters as well. The 18-core die of processor designed around two bidirectional rings, first is connected with 8 cores and another 10 cores are connected with second ring [9], [3]. Each ring partition controlled by an integrated memory controller (IMC) with two memory channels and these rings exchange data via queues between the partitions. AVX2 (256 bit integer SIMD instructions) is enhanced extensions of Instruction Set Architecture (ISA) compare to earlier original AVX, it is largely supported for floating point instructions. With some micro-architectural enhancements especially for 128-bit operation, enhanced AVX2 brings integer SIMD to the full YMM registers and AVX2 has introduced 16 new instructions for loads that can fetch 4 or 8 non-contiguous data elements using special vector addressing capabilities for both integer and floating point (FP) SIMD. In addition to gather non-contiguous data element, there is Intel’s introduced Fused Multiply Add (FMA) that incorporates 36 FP instructions. It can perform 256-bit computations and also 60 instructions for 128-bit vectors compared to original architecture of FMA that supports four operand instructions only. The fused multiply-add (FMA) and AVX2 remarkable increase the theoretical peak performance[2],[13] by issuing Two AVX or FMA operations per cycle. The major improvement of Intel Haswell-EX processors is on chip multiple Fully Integrated Voltage Regulators (FIVR [12]). By integrating, even more fine control to the power states is possible with reduced on die reactive latency. The input voltage control is done by sending signals of serial voltage ID (SVID) to the Main Board Voltage Regulator (MBVR), which then regulates VCCin accordingly. According to power consumption estimation, three different states of power activation is done by processor with support of MBVR [12]. The Energy Performance Bias (EPB) feature of Intel processor influences the operating frequency selection and it is done in BIOS setting. The EPB setting can be configured by writing into Model-Specific Register (MSR). However, the 16 possible settings are defined by 4 bits only. A setting of 0, 6, and 15 of MSR are used for performance, balanced, and energy saving, respectively [14],[15],[16]. The section of FIVR in Haswell-micro-architecture provides individual voltages for every core and energy-aware runtime feature lowers single core power consumption with keeping

the other cores performance at its maximum level. The uncore frequency has remarkably influence the on-die cache-lines transfer rates together with memory bandwidth and it greatly depends on stall cycle of core. Compared to common frequency for both core and uncore section in Ivy-Bridge. New feature has introduced in Haswell, where Uncore Frequency Scaling (UFS) independently control the frequency of the uncore components irrespective of core. Branch Predictors

Instruction Fetch Unit

L1-ITLB 32 KB L1 I-Cache (8way) 168 Pre-decode, fetch Buffer 6 Instructions 2 x 20 Instruction Queue Complex Decoder

µ-code

4 µ4 µ-ops ops 1.5K µ-op Cache (8way)

Simple Decoder

Simple Decoder

4 µ-ops

Simple Decoder

4 µ-ops

1 µops

2x28 µ-op Decode Queue

4 µops 192- Entry Record Buffer (ROB)

168 Integer Register

168 48 72 42 AVX Entry Branch Entry Entry Register Order Buffer Load buffer Store Buffer

60 Entry unified Scheduler Port-0

Port-1

ALU 256-bit Branch VMUL Shift Vshift

ALU LEA MU L

Port-5 ALU Fast LEA

256-Bit 256 128 FMA Bit Bit F V MUL Blend FADD V Shuffle L2 TLB

Port-6

Port-2

256 ALU Bit Branc V ALU h V Shuffle Shift

256 Bit F Shuffle F Blend

64 Bit AGU

Port-4

Port-3 64 Bit AGU

Store AUG Port-7

Store Data

L1 DTLB 32 KBL1 D-Cache ( 8way ) 256 KB L2 Cache ( 8-Way)

Fig2: Haswell Micro-Architecture

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) The energy-efficient turbo (EET) feature of Haswell attempts to reduce the usage of turbo frequencies because the performance is not significantly increased. EET feature continuously monitors the number of the EPB setting and as well as stall cycles. However, the EET monitoring mechanism polls stall data capriciously. Therefore, EET may diminish energy efficient workloads and performance that revamp their characteristics at an unfavorable rate. The Haswell core architecture has big size out-of-order section compared to Ivy Bridge, with a remarkable increase in dispatch ports and execution resources [2]. Together with enhancements in the ISA extensions, the integer operations per core and theoretical FLOPs have almost become double. Most significantly, the bandwidth of the cache hierarchy, including the L2 and L1D has double with reduced utilization of bottlenecks. The Haswell micro architecture offers 4 times of the peak FLOPs, 3 times of the cache bandwidth, and nearly 2 times of the reordering window along with this improved Virtualization for Haswell. Haswell has added feature of accessing dirty bits in the extended page tables, which reduces Virtual Machine related transitions and the newly introduced VMFUNC instruction allow VMs to use hypervisor functions without exiting that result in the improved round-trip latency for VM transitions that is below 500 cycles and the I/O virtualization page tables supports a full 4-level structure. The overall estimation that a Haswell micro-architecture will offer, is around 10% to 15 % improved performance over Ivy Bridge[5]. Table 1:Micro-Archiecturral Comparison of Ivy Bridge (8890-v2) and Haswell (8890-v3) [9],[3],[6],[7]. Micro-architecture

TABLE II EXPECTED CACHE HIERARCHY LATENCY AND BANDWIDTH [1],[18],[20] Cache Level

Parameter

Ivy Bridge

Haswell

L1 Data Cache L1 Data Cache L2 (Unified)

Latency

4 cycle

4 cycle

Bandwidth

2*32 core/cycle

2*64 core/cycle

Latency

12 cycle

11 cycle

L2 (Unified)

Bandwidth

1*32 core/cycle

1*64 core/cycle

L3 (LLC)

Latency

36-31 cycle

34 cycle

L3 (LLC)

Bandwidth

1*32 core/cycle

1*64 core/cycle

RAM

Latency

30 cycle + 53 ns

36 cycle + 57 ns

IV. HPC Interconnects Interconnect technology is perhaps the single most important factor in terms of HPC cluster environments of particular interest to network professionals, through which the nodes of cluster connect and communicate. Speed and bandwidth of interconnect technologies can limit top end performance and also determine the scalability. TABLE III INTER-CACHE MEMORY LATENCY (CYCLE/CACHE LINE [1],[34] Inter cache Parameter Ivy Bridge (ns) Haswell (ns) level (R/W) (Latency) L2 to L1 Parallel ~ 2.5 ~ 2.3 random read L2 to L1 Read (64 Bytes ~ 2.2 ~ 2.2 Step) L3 to L1 Parallel ~ 4.8 ~5 random read L3 to L1 Read (64 Bytes ~ 4.9 ~ 4.9 Step)

Ivy-bridge(E7-v2)

Haswell (E7-v3)

Decode

4(+1)/cycle

4(+1)/cycle

Number of Core

15

18

Allocation queue

28/56

56

Execute Function Size

6 micro-ops/cycle

8 micro-ops/cycle

TABLE IV INTEGER SPECINT_RATE_BASE2006 BENCHMARK PERFORMANCE [3],[4] 8 Socket Board Micro-architecture SPECint_rate_base2006

Retire Function Size

4 micro-ops/cycle

4 micro-ops/cycle

E7-8890-v2

Ivy Bridge

4570

Scheduler Entries Size

54

60

E7-8890-v3

Haswell

5470

ROB entries Size

168

192

Integer/Floating Register file SIMD ISA

160/144

168/168

AVX

AVX2

Flops/cycle

8DP

16DP

TABLE V INTEGER SPECINT_RATE2006 BENCHMARK PERFORMANCE [3],[4] 8 Socket Board Micro-architecture SPECint_rate2006

Load/Store buffer

64/36

72/42

E7-8890-v2

Ivy Bridge

4710

E7-8890-v3

Haswell

5630

DRAM Bandwidth

51.2 GB/s

68.2 GB/s

QIP Speed

8GT/s

9.6GT/s

L3 Cache latency

15.5 ns

12.14 ns

L3 Cache/core

2.5 MB

2.5 MB

L2 Cache/core

256 KB

256 KB

L1 Data Cache/core

32KB

32KB

L1 Instruction Cache

32KB

32KB

Line fill buffer (LFB

10 entries

10 entries

Reservation Station

56 entries

60 entries

Number of Ports

6

8

RAM Type Supported

DDR3(1600 MHz)

DDR4(1866MHz)

10GigE has 5-6 times the latency of InfiniBand and InfiniBand supports 3.7 times of throughput of 10GigE. The excitement behind 10GigE is that it is TCP/IP based and virtually all network designers understand TCP and are familiar with it.TCP does not guarantee delivery of Packets in order. TCP protocol would just retransmit the missing packet while InfiniBand is guaranteed to have in-order packet delivery with no dropped packets. InfiniBand was especially designed for meeting need of HPC Cluster networks that required guaranteed packet delivery. The performance gap

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) between Ethernet and InfiniBand options has been virtually closed with the availability of 40Gb Ethernet. TABLE VI FLOATING POINT SPECFP_RATE_BASE2006 [3],[4] 8 Socket Board Micro-architecture SPECfp_rate_base2006 E7-8890-v2

Ivy Bridge

3240

E7-8890-v3

Haswell

3850

use today are Gigabit Ethernet (10 GigE, 40 GigE and 100 GigE) and infiniband (FDR, DDR and EDR). Gigabit Ethernet are gaining popularity because it not only offers good performance for many applications but it is inexpensive in terms of most server motherboards already has at least two GigE NICs and most desktops have at least one or more GigE NICs, also. Feature

TABLE VII FLOATING POINT SPECFP_RATE2006 [3],[4] Micro-architecture SPECfp_rate2006 8 Socket Board E7-8890-v2

Ivy Bridge

3310

E7-8890-v3

Haswell

3910

TABLE VIII FOUR SOCKET SPECVIRT_SC2013 BENCHMARK PERFORMANCE [5] Micro-architecture SPECint_rate2006 VM 8 Socket Board E7-8890-v2 Ivy Bridge 2086 121 E7-8890-v3

Haswell

2655

Network Operating system Packet Forwarding

Feature

Feature

Feature

Feature

147

TABLE IX INTERCONNECT PERFORMANCES [36] Interconnect Latency (microseconds) Bandwidth (MBps) GigE

47.16

112

10 GigE

12.51

875

40 GigE

Feature

4

5000

SDR Infiniband

2.6

938

DDR Infiniband

2.25

1502

QDR InfiniBand

1.67

3230

FDR InfiniBand

0.7

6800

EDR InfiniBand

0.5

12000

In traditional HPC clusters, big question is type of applications; it could be fine-grained or coarse-grained and it helps in deciding HPC cluster requirement. In parallel computing process, independent processes are performed their respective computations on individual nodes and then synchronize and coordinate their results by passing messages over the network [36]. Processing Bottlenecks in Traditional Protocols are generally caused by Data buffering, data integrity and routing algorithms and may be due to hardware interrupts on packets arrival and transmission and software signaling from different layers to handle protocol processing in different priority levels. Bottlenecks are not only due to processing protocols but also due to I/O interfaces and network. Coarse-grained applications spend more time on computing than communicating, whereas fine-grained applications spend more time in communication among compute nodes than computing. High-speed Ethernet and InfiniBand were introduced into the market to address identified bottlenecks in network communication. InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and network speed) whereas Ethernet aimed to handling the network speed bottleneck that directly relying on I/O bus bottlenecks and complementary technologies to alleviate the protocol processing[37]. The most common interconnect in

Network Operating system

Network Operating system

Packet Forwarding

Packet Forwarding

Fig3: Traditional Network Structure

V. Software Defined Networking The network topologies used in big size HPC cluster systems, are often complex and involve multiple connections between switches and servers along with protocol compatibility and motherboard architecture [19]. One source of complexity emerges from the need to support high bandwidth availability operations and low latency while making reductions of congestions in the HPC cluster. Although interconnection schemes of cores are efficient for homogeneous clusters but it can still contain some irregularity when heterogeneous processors are connected in cluster, which renders the overall HPC Cluster system more difficult to model. In addition, link failure and congestion is other common issues when interacting with large HPC Cluster network. Large HPC cluster network employed legacy switches possibly at multiple different levels to connect computing nodes in network (Figure-3) [28],[29],[30]. Traditional HPC networks are IP based configuration network [20],[21],[26] and it dictates topology so it has configuration complexity along with topological complexity. In addition to that control plane and data plane are bundled inside the networking devices which reducing flexibility. In traditional HPC networking, switching and routing are based on IP and MAC header so there is a need of controlling switching and minimizing routing and convergence. Each router and switch in HPC network, calculate shortest route for all possible connected compute nodes and creates a paths table that controls the forwarding of each IP packet to the next hop in its route [22],[23],[24] and

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) Feature

Feature

Feature

Feature

Feature

Network Operating System

Packet Forwarding

Packet Forwarding

Packet Forwarding

Fig4: SDN Based Network Structure

another important issues identified is that in traditional HPC networking, routers and switches are agnostic to applications being served or processed by the servers for effective utilization of bandwidth, are not optimal (Figure-3) . Static network configurations are used to meet the network bandwidth and latency requirement of application. In traditional HPC network, network operators are responsible for configuring policies to respond to wide range of network events and applications. Transformation of these high levelpolicies into low level configuration commands to done manually, while adapting to changing network conditions [35],[36]. This is quite challenging and error prone task. HPC Client Applications

HPC Node

2

HPC Node

n

SDN Switch

HPC Node

1

HPC Node

2

Fig5: SDN Based HPC Architecture

Data Plane

SDN Architecture

Network State/ Topology/Awareness

SDN Switch

HPC Node 1

Control Plane

HPC Optimizer/ Application

HPC Node m

In traditional HPC cluster network scenario, decisions can only be made based on local information and it greatly making an inefficient distribution of global resources. Because of missing global information at the application layer, users are clueless about the optimal point in time to submit job. On the other side, no mechanism are available which allows announcement of users' requirements, so that the resources requirement can be planed for the upcoming traffic in advance along with hybrid network architecture, heterogeneous infrastructures and multi-vendor environments which steadily increase the complexity of networking, proprietary protocol issues data mining and computing. To overcome such challenges, the idea of programmable networks (Software Defined Network) has been proposed (Figure-4), architectural difference between traditional and SDN network is shown in Figure-3 & Figure-4 respectively. In SDN, forwarding hardware is decoupled from control decision and control decisions are logically centralized as software based controller. There is need of SDN based HPC solution (shown in Figure-5), for enhanced throughput and better utilization of bandwidth for meeting the need of today's Scientific and Business computation needs. TABLE X DIFFERENCES BETWEEN TRADITIONAL AND SDN NETWORK

Traditional Networking Distributed control plane Protocol Based Configurable network Not useful for New business venture Little agility and flexibility Difficult to adapt changing needs Network scaling is unsustainable Real time policies changes difficult

Software Defined Networking Logically centralized control plane API based Programmable network New business venture Good Agility and flexibility Adapt changing Business needs Easily scalable Real time policies changes possible

VI Conclusion The demand of High Performance Computing (HPC) has been increased in Scientific and Business Environment (SBE). A HPC systems are build as cluster system and these clusters are connected through high speed interconnects. There has been good development in Ethernet and Infiniband high speed interconnect solution in terms of bandwidth. Even with increased bandwidth, utilization of bandwidth is limited to 60% to 80% and correspondingly, throughput of interconnection network is ranging from 45% to 60% based on application being run on HPC cluster. On other side, heterogeneous infrastructures, multi-vendor environments and hybrid network architecture steadily increase the complexity of protocol stacks, computing, data mining and network compatibility issues.

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) REFERENCES [01] [02]

[03] [04] [05] [06] [07] [08] [09] [10] [11] [12]

[13] [14]

[15]

[16] [17]

Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel, Jan 2016, order Number:248966032. D. Hackenberg, R. Sch¨one, T.s Ilsche, D. Molka, J. Schuchart, R. Geyer "An Energy Efficiency Feature Survey of the Intel Haswell Processor" in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International. Intel® Xeon® Processor E7 v3 Family Technical Compute Benchmarks 2015. Intel® Xeon® Processor E7 v3 Family General Compute Benchmarks 2015. Intel® Xeon® Processor E7 v3 Family Virtualization Benchmarks 2015. Intel® Xeon® Processor E7-8800/4800/2800 v2 Product Family, DataSheet Volume 1, Feb 2014, Reference Number: 329594-001. Intel® Xeon® Processor E7-8800/4800/2800 v2 Product Family, DataSheet Volume 2, Feb 2014, Reference Number: 329594-002. Intel® Xeon® Processor E7-8800/4800/2800 v3 Product Family, Data Sheet Volume 1 May 2014, Reference Number: 332314-001US. Intel® Xeon® Processor E7-8800/4800/2800 v3 Product Family, DataSheet Volume 1 May 2014, Reference Number: 332314-002US. I. Esmer, S.Kottapalli "Ivybridge Server Architecture: A Converged Server" Published in Intel Hotchips 2014. Intel Xeon Processor E5 v3 Family Uncore Performance Monitoring Reference Manual, Intel, 92014. E. Burton, G. Schrom, F. Paillet, J. Douglas, W. Lambert, K. Radhakrishnan, and M. Hill, “FIVR – Fully integrated voltage regulators on 4th generation Intel Core SoCs,” in Applied Power Electronics Conference and Exposition (APEC), 2014 Twenty-Ninth Annual IEEE, March 2014,pp. 432–439. Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel, Sep 2014, order Number: 248966-030. N. Kurd, M. Chowdhury, E. Burton, T. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar, “Haswell: A family of IA 22nm processors,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2 2014, pp. 112–113. S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li,S. Vora, R. Varada, and E. Wang, “5.4 Ivytown: A 22nm 15-core enterprise Xeon processor family,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, Feb 2014, pp. 102–103. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A, 3B, and 3C: System Programming Guide, Intel, Sep 2014. G. Lento, “Optimizing Performance with Intel Advanced

Vector Extensions,” online, September 2014. [18] R. Sch¨one, D. Molka, and M. Werner, “Wake-up latencies for processor idle states on current x86 processors,” Computer Science –Research and Development, 2014. [19] Brice Goglin, Joshua Hursey, Jerey M. Squyres. netloc: Towards a Comprehensive View of the HPC System Topology. Fifth International Workshop on Parallel SoftwareTools and Tool Infrastructures (PSTI2014), Sep 2014, Minneapolis,United States. IEEE. [20] Polezhaev P., Shukhman A., Ushakov Y.,"Network Resources Control System for HPC based on SDN" New2AN/ruSMART 2014, LNCS 8638,pp. 219-230, 2014. [21] A. Lara, A. Kolasani, and B. Ramamurthy, “Network innovation using OpenFlow: A survey,” Communications Surveys utorials, IEEE,vol. 16, no. 1, pp. 493–512, First 2014. [22] B. Nunes, M. Mendonca, X.-N. Nguyen, K. Obraczka, and T. Turletti,“A survey of software-defined networking:Past, present and future of programmable networks,” Communications Surveys Tutorials, IEEE,vol.16, no. 3, pp. 1617–1634, Third 2014. [23] M. Casado, N. Foster, and A. Guha, “Abstractions for software-defined networks,” Commun.ACM, vol. 57, no. 10, pp. 86–95, Sep. 2014. [24] C. E. Rothenberg, R. Chua, J. Bailey, M. Winter, C. Correa, S. Lucena,and M. Salvador,“When open source meets network Control planes,” IEEE Computer Special Issue on Software-Defined Networking, November 2014. [25] Frederic Giroire, Joanna Moulierac, Truong Khoa Phan. Optimizing Rule Placement In Software-Defined Networks for Energy-aware Routing. IEEE GLOBECOM, Dec 2014, Austin Texas, United States. [26] G. Andy, G.B.Reinhard, m.Yvonne, S.Rolf and .Hubert.” An Integrated SDN Architecture for Application Driven Networking” IJASM, vol 1&2 , Year 2014. [27] K. Phemius and M.Bouet. Openflow: Why latency does matter. In integrated Network Management ( IM 2013), 2013 IFIP/IEEE Int. Symp. On, pages 680-683, 2013. [28] D. Hackenberg, R. Oldenburg, D. Molka, and R. Schone, “Introducing FIRESTARTER: A processor stress test utility,” in International Green Computing Conference (IGCC), 2013. [29] S. Latifi, A. Durresi and B. Cico. Separating network control from routers with Software Defining Networking. BCI’13 September 19-21, Thessaloniki, Greece. Copyright l’2013 for the individual papers by the papers’ authors. [30] Hyojoon Kim; Feamster, N., "Improving network management with software defined networking," Communications Magazine, IEEE , vol.51, no.2, pp.114,119, February 2013. [31] S. H. Yaganeh, A. Tootoonchian, Y. Ganjali. On the Scalability of Software-Defined Networking. In IEEE Communications Magazine Feb 2013. [32] H. Kim, N. Feamster. Improving Network Management with Software Defined Networking. In IEEE Comm-

2017 International Conference on Computational Intelligence in Data Science(ICCIDS) unications Magazine Feb 2013. [33] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, and R. Sherwood,“ On controller Performance in software-defined networks,” in Proceedings of the 2nd USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Network sand Services, ser.Hot-ICE’12. Berkeley, CA, USA: USENIX Association, 2012, pp.10–10. [34] R. Sch¨one, D. Hackenberg, and D. Molka, “Memory Performance at reduced CPU clock speeds: an analysis of Current x86 64 processors,” in Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems, ser. HotPower’12. Berkeley,CA, USA:USENIX Association, 2012, pp. 9–9. [35] A. P. Bianzino, C. Chaudet, D. Rossi, and J. Rougier. “A Survey of Green Networking Research”. In IEEE Communication Surveys and Tutorials, volume 14, pages 3 – 20,2012. [36] White paper on "Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing". [37] Ali, R. Rajagopalan “HPC Cluster Interconnect” Reprinted from Dell Power Solutions,October 2004. Copyright l’ 2004 Dell Inc. [38] James Aweya. IP Router Architectures: An Overview. Journal of Systems Architecture 46 (2000) pp.483-511, 1999.

Suggest Documents