1
A fast multi language parallel computing middleware Design, implementation and applications M.O.Bensalah 1, Bouchaib Cherradi 2
Mohamed Youssfi, Omar Bouattane Dept. Math Info, ENSET Université Hassan II Mohammedia Casablanca Mohammedia, Morroco
[email protected],
[email protected]
1
Faculté des sciences de Rabat, 2FST Mohammedia 1 Université Mohammed V Rabat Agdal, 2 Université Hassan II Mohammedia Casablanca
Abstract— In this paper we present the design of a parallel distributed computing middleware for high performance computing. The middleware allows each developer to create its specific applications. These applications are assigned to be implemented on a grid of computers linked by a local or a remote network. The designed software infrastructure of each node of the grid is based mainly on middleware, for data broadcasting and tasks distribution, and an application server that lets us deploy and launch the execution of parallel programs from any node of the Grid computing. The parallel application based in this middleware can be implemented in different languages. In this paper, we present the java implementation of this middleware. To valid the performance of this Middleware, the paper presents two examples of parallel filtering and segmentation c-means algorithms, using program code and flow sheet.
for which several parallel algorithms for scientific calculus were developed. At first, it was viewed as a simple grid of cellular automata, after some technological enhancement, the cellular automaton became a fine grained processing element and the resulted grid became the reconfigurable architecture [1, 2, 3, 4, 5]. Due to the large cost of the parallel machines, the emulating solutions were proposed in the literature to elaborate some virtual systems [6]. The emulated systems may be specific as in [10, 11] or of general behaviors as in [10, 8].
Keywords — Middleware, Parallel computing, Distributed computing, Grid computing, Cloud computing, Middleware, Parallel virtual machine, Parallel image processing, Parallel image segmentation, Parallel components contour detection
In this paper we present the design of a parallel distributed computing middleware for high performance computing systems. This middleware allows each developer to create its specific applications. These applications are assigned to be implemented on a grid of computers linked by LAN, VPN, Internet etc. The designed software infrastructure of each node of the grid is based mainly on middleware, for data broadcasting and tasks distribution, and an application server that lets us deploy and launch the execution of parallel programs from any node of the Grid computing.
I.
INTRODUCTION
Recently, in the data analysis and signal processing domain, the analysis tools, the computation methods, and their technological computational models, have known a very high level of progress. This progress has oriented the scientists toward new computation strategies based on parallel approaches. Due to the large volume of data to be processed and to the large amount of computations needed to solve a given problem, the basic idea is to split tasks and data so that we can easily perform their corresponding algorithms concurrently on different physical computational units. Actually, we distinguish several computer architectures, starting from a single processor computer model, until the grids of massively fine grained parallel machines having a large amount of processing elements interconnected according to several topological networks. The need of the new architectures and the processor efficiency improvement has been excited and encouraged by the VLSI development. The parallel interconnection [1] of fine grained networks. In this paper, our study is focused both on a fine grained massively parallel architecture and coarse grained parallel architectures, that has been largely studied in the literature and
In order to integrate these emulators in the real computing grids, it necessary to use the middleware that manage the complexity of distributed systems and simplify the development process [4]. As defined in [4], “Middleware is a class of software technologies designed to help manage the complexity and heterogeneity inherent in distributed systems.
In the parallel image processing domain, we deploy the resulted grid to perform some proposed parallel algorithms such as the parallel c-means clustering algorithm. In [12, 13, 14, 15, 16] the authors demonstrate the effectiveness and how the complexity of this parallel algorithm can be reduced in the parallel computational models. In this paper, we present a parallel algorithm for MRI cerebral image filtering using Sobel operator. Parallel c-means classification algorithm is also implemented in our grid to show how the corresponding program is subdivided into parallel tasks that must be performed simultaneously over the grid nodes. This paper is organized as follows: Section 2 presents the computational model used to implement our parallel algorithms. The parallel filtering and segmentation c-means algorithms are presented in section 3 using program code and flow sheet. Finally, the last section gives some concluding remarks on this work.
2 II.
GRID BASED MIDDELWARE ARCHITECTURE
The proposed grid as in figure 1 corresponds to a set of nodes. Each node is represented by a real or virtual processing element. The node can start running an application for which it is a Master. It can be asked to participate in the performance of other applications as a slave. Any parallel algorithm assigned to be performed on the grid requires that the algorithm must be divided into a set of tasks. The resulted tasks are launched to start their execution simultaneously on the nodes of the grid. Each task is assigned to a virtual processing element managed autonomously by a thread. For any application, each Virtual Processing Element (VPE) must receive its own data and its own program code. The PE is linked to other PEs local or remote bus according to the topology of the problem.
Figure 1. Graphic representation of a grid computing with one Master, 32 slaves and 7 PEs for each node. All PEs can exchange each other their data and instructions during the program execution. They must be available to carry out new tasks and return back the results to the master PE of this task. In order to communicate, the reference of each PE is published in the naming service using a public name chosen by the host depending on the topology of the problem. In this work, we'll look at the Mesh 2D topology, that well suited for distributed matrix computation and particularly, we focus on solving problems of image processing of very large sizes. Such problems require significant computation time in the case of sequential processing. The nodes of the grid are managed by an application server which starts by creating the remote PEManager object. The reference of this later is published in a naming service developed for this middleware. Each PEManager can represent the master for any parallel application executed from its node or slave for application executed from other nodes of the grid. The master PEManager must distribute program code and data of parallel program tasks. Each task is assigned to PE governed. The virtual elementary process, taking responsibility to run the code that describes the task using the elementary data, is created by a PE Manager. Its life cycle is as follows: •
The PE is created by the PEManager at the request of the master.
•
Load elementary data received from the PEManager
•
Load program code implementing the task to perform
•
Run the program code using elementary data
•
If treatment neighborhood, the PE can exchange with its neighboring PEs, pieces of data necessary for further processing.
•
Can be Activated or deactivated
•
Provide basic results to the host, at his request. III.
VIRTUAL PROCESSING ELEMENT STRUCTURE
In this section, we have translated all the components and features of a physical PE, as shown in Figure 2. Each PE is defined by its state and behavior must have a local or remote identification.
Figure 2: Representation of PE components. The state model of a PE describes all its physical components. They are: Identifier registers: When a created PE is inserted in the mesh model, its instance variables iReg and jReg which represent its lo-cation coordinates in the n x n matrix are set. The identi-fier register of this PE idReg, will take the value com-puter in the row major order by: idReg = n*iReg+ jReg. Internal registers: For any given computation problem, each PE must use some internal registers to save data and the results of any related processing. To do so, we define in the PE model an array of internal data registers named “reg [..]”. In this model, we have defined arbitrarily an array of 16 data registers. This array may be extended dynamically to any other size according to the problem in query. Flags: As any standard processor, we introduce in the PE model a special flag register, where each of its flag bits will indicate the PE state related to any performed in-struction. This register is arbitrarily defined by an array of 16 bits, but it can be extended to any large size ac-cording to any additional useful instruction. Communication Ports: In the real RMC machine, all the PE’s can exchange data throw their communication ports. When the PE is asked to perform delegates this operation to its Arithmetic The PE can communicate with other local grid using its logical ports. The cost
any operations, it and Logic Unit. or remote PEs the of communication
3 between PEs depends on the topology of the grid. It is therefore crucial to choose for each parallel program the best topology that optimizes its complexity. In the AbstractVirtalPE class we have implemented three kinds of operations: Basic operation, Data Exchange operations and configuration operations. IV.
MIDDLEWARE OBJECT MODEL
The VPEs can be distributed in different machines with the ability to migrate from one machine to another depending on system load balancing. Each group of VPEs of a computer is controlled by a "VPEManager." The latter is responsible for: - Manage the lifecycle of its VPEs. - Manage remote access to remote VPEs using the naming service - Participates in the management of the of load balancing problem. The Parallel Virtual Machine (PVM) based on this middleware is closed to changes but open to extension. That is why we described the VPE by an abstract class that declares its intrinsic features with the ability to define multiple implementations. The programmer may therefore develop its own implementation of VPE without the need to change anything in the core of the PVM. For each new implementation of AbstractVPE, the programmer can create a VPE factory associated with this new type. The virtual machine is formed by a parallel set of VPEs stored in a dynamic data structure defined by the topology selected by the developer. We have implemented in this version of Virtual Machine Mesh2D topology, but the programmer may implements its own topology. To give this possibility, we defined the core of this virtual machine: - A class called "LocalVirtualGrid" in which we have implemented most of the operations of local data parallelization and treatment in a node of the grid. - A class called "RemoteVirtualGrid" that implements the bulk of data and tasks distribution operations to remote nodes. The programmer car writes its parallel programs using XML language or classic programming languages like java or C++. To construct an XML parallel program, some sets of instructions were. This representation translates the following concept scheme: A program is a set of instructions. Each instruction is defined by a name, a set of attributes and can contain other instructions. Each attribute is defined by a name and a value. The figure 2 shows the most classes of the middleware object model corresponding to preceding description.
Figure 2 : UML class diagram of the middleware. V.
APPLICATIONS
In this section, we will present two applications. The first one corresponds to a parallel implementation of a component contour detection of a gray leveled image using Sobel operator. This application is an XML implementation witch shows how our parallel virtual machine can be used to parallelize sequential algorithms using Virtual Reconfigurable Mesh Computer architecture with SIMD parallel programming. The second application presents a parallel program for medical image segmentation. in this application, we show how to create a parallel and distributed application as a SPMD program (Single Program Multiple Data) in our platform using java language. A. Parallel component contour detection. The XML parallel program with comments is presented as following. The resulted image is presented in figure 3.
4
a- Input Image
b- Result Image
Figure 3 : Result of parallel components contour detection using Sobel Operator B.
Parallel classification CMEAN Algorithm
/* Create an array to calculate the number of pixels belonging to each class center */ /* For each pixel */ /* Determine the nearest class center to the pixel. */ /* Increment the number of pixels of the membership class*/ /* Add the pixel color to the sum of the pixel colors of the membership class */ /* Add the distance of the current pixel from the membership class to the sum of the distances of the same cluster pixels from the same membership class */ /* Define the class membership of the current pixel */ /* Loop */ /* Storing the three tables in a temporary result matrix that will be sent to the Master PEManager */ double[][] res1=new double[][]{nombresPixels,sommeDistances,sommeCouleur s}; /* Load this matrix into the VPE memory */ loadTransientLocalData(new DataMatrixDouble(res1)); /* Loading the current segmented elementary image into the VPE memory */ loadFinalResutData(new DataMatrixDouble(classification)); } }
B.2. Parallel and distributed program The chart in Figure 4, shows the parallel and sequential steps of the parallel classification implemented in our platform. Split image
Image segmentation [9] is a splitting process of images into a set of regions, classes or homogeneous sub-sets according to some criteria. Usually, gray levels, texture or shapes constitute the well-used segmenting criteria. Their choice is frequently based on the kind of images and the goals to be reached after processing. Image segmentation can be considered as an image processing problem or a pattern recognition one.
Broadcasting data image over the grid Broadcasting program code over the grid
B.1. Implementation of elementary classification task. We will first create a new implementation of the VPE to add to the behavior of classification: package vpm.pe; public class ClassificationVPE extends AbstractPE { public void classification(){ // Retrieve the elemental image matrix double[][] data1=getLocalData("image1"). d1.getData(); // Retrieve the current cluster centers double[]classCenters=getLocalData("classCent ers").getData()[0]; // Create the matrix classified image results double[][] classification=new double[data1.length][data1[0].length]; /* Create an array to calculate the distances of the current pixel to the cluster centers */ /* Create an array to calculate the sum of the distances of all pixels to each class center */ /* Create an array to calculate the sum of all the colors of pixels of each class center */
Broadcasting Class centers
Local Class determination task assigned to VPE 1
Local Class determination task assigned to VPE 2
Local Class determination task assigned to VPE n
Global Class determination task assigned to Master
N
Test ?
Y
END
Figure 4 : parallel classification algorithm chart.
5 The java implementation of the corresponding parallel program based on our platform, with comments, is presented as following : package vpm.pe; mport java.util.List; public class ImageSegmentationParallelApp extends AbstractParallelProgram { @Override public void run() { // The image to be segmented String imageName="/data/cerveau3"; /* Spliting the image to matrix (6 x 8) of elementary images */ List data= ImageUtil.splitDoubleDataMatrix(imageName+".jpg", 0, 6,8,"image1"); /* Set the task to be executed by all VPEs */ setCurrentParallelTask("classification"); /* Initialize the cluster centers */ double[][] centers=new double[][]{{1,3,4,5,6}}; /* Distribute the list of data items to the grid VPEs */ loadData(data); int nb=0; /* Iterations couter */ double JA=0; /* The cost function at iteration i-1*/ double JN=0; /* The cost function at iteration i */ double err=0.1; /* The precision JN-JA */ /* Start de Loop */ do{ /* Broadcasting cluster centers to all VPEs */ loadSingleData(new DataMatrixDouble(centers),"classCenters" ); /* Ask all VPEs to execute the classification task */ runParallelTask(); /* Geting transient elementary matrix result containing
parameters that were used to calculate the new global cluster centers of the image */ List res=getTransientDataResult(); double[] sommeGlobalDonnes=new double[centers[0].length]; double[] nbGlobalPixel=new double[centers[0].length]; double[] sommeGlobalDistances=new double[centers[0].length]; /* Assembly of elementary results by summing */ for(Data dd:res){ for(int i=0;i