Department of Numerical Analysis and Computer Science TRITA-NA-P0305 • ISSN 1101-2250 • ISRN KTH/NA/P-03/05SE
Mapping of the BCPNN onto Cluster Computers
Christopher Johansson and Anders Lansner
Report from Studies of Artificial Neural Systems, (SANS)
Numerical Analysis and Computer Science (Nada) Royal Institute of Technology (KTH) S-100 44 STOCKHOLM, Sweden
Mapping of the BCPNN onto Cluster Computers
Christopher Johansson* and Anders Lansner
TRITA-NA-P0305
Abstract We describe how complex systems of multiple BCPNN (Bayesian Confidence Propagating Neural Networks) networks are modeled, implemented and run on parallel cluster computers. The BCPNN system is modeled in terms of populations and projections. Hypercolumns and spiking activity is used to get a scalable and efficient implementation. Three communication protocols are evaluated; MPICH, TCP/IP and UDP. Fully recurrent BCPNN:s with up to 6·104 units that allocate more than 34 GB of memory are run. The results are that UDP provides a scalable and efficient communication protocol. The BCPNN with hypercolumns is well suited and scales well on cluster computers.
Keywords: Parallel Computations; Cluster Computer; BCPNN; Neural Network; UDP; Java.
* E-mail:
[email protected]
Introduction Parallel implementations of neural networks (NN:s) have been studied extensively since NN:s are inherently parallel in their structure. The focus of many studies is on how the communication between the units (neurons) of the NN is implemented effectively. The communication in a parallel implementation of a NN often becomes a bottleneck for instance in the Hopfield NN the communication requirement grows quadratically while the computational load grows linearly with respect to the number of units. On standard parallel computers and simple hardware implementations the communication between the nodes (processing elements, PE:s) is based on a bus architecture which means that the communication bandwidth is constant and does not grow with the number of nodes. A naïve implementation of a standard NN will therefore suffer from a communication bottleneck. The parallel computers and digital hardware that is used today are not ideal for emulating NN:s that do not really require determinism, double precision computation, and reliable communication between units. Exploiting the algorithmic robustness is a key to efficient hardware. By removing some of the control mechanisms for communication substantial gains of speed and scalability is achieved. We explore this possibility in the design of an effective parallel implementation on cluster computers. This will become even more important in the future when transistors get smaller and the noise level of the hardware is increasing. The cost of sheltering applications from the increased noise level in hardware is going to rise drastically. A key technology will thus be to use applications such a NN:s that are somewhat immune to noise. This picture also fits in the framework of nanotechnology where noise levels and quantum effects are pronounced. Providing noise insensitive, high performance, computational resources for these structures will be essential [1]. Parallel implementations of NN:s is an area that has been studied since the 1980’s. There exist a wide rang of different NN:s algorithms and therefore the variety of parallel implementations of NN:s is also great. As previously mentioned a common theme in many of the implementations is that the communication constitutes a bottleneck. A lot of effort has been put into getting a scalable implementation of the Back-propagation algorithm [2], which has a very integrated structure. Good overviews of parallel NN:s implementations are provided by Misra [3] and Nordström [4]. The goal of this work is to develop an method for implementing large NN:s on cluster computers. The NN can either be designed as a single network or as a neural system comprising of several interacting sub networks. A premise is that the NN learning algorithm is based on correlations. We also want to explore the possibility of letting the NN algorithm handle hardware errors and communication disturbances. A key issue in the implementation is to avoid getting constrained by the communication between the processors. Levels of Parallelism There exists a large number of different NN architectures; multilayer feedforward NN:s, recurrent NN:s, self organizing maps (SOM:s), radial basis function NN:s (RBF), ART NN:s, and Boltzmann machines. The connectivity in these NN:s can differ widely depending on the particular implementation. The training algorithm also
affects the design of a parallel implementation. The focus of this work is on recurrent NN:s that uses an associative type of Hebbian learning [5]. Many investigations of parallel NN implementations discuss different levels of parallelism [6]; training session parallelism, training example parallelism, NN or layer parallelism, unit parallelism, and weight parallelism. The NN that we use implements incremental learning and therefore the first two levels, training session and training example parallelism, are dismissed. We also exclude the last level parallelism because of the small computational grain. The remaining three levels, NN or layer parallelism and unit parallelism are those we consider relevant here. In this work we propose a new level not listed above, parallelism over hypercolumns, which is between the level of the unit and the NN. All four levels are listed in Table 1, together with figures on the number of computational grains for a NN with 104 units. Larger grains are better because faster local communication is then used. But the grain size cannot be too big, since the goal is to distribute the computational load. Using the hypercolumn as a computational grain provides a good fit to the number of nodes on modern parallel computers. Table 1 The figures are based on a NN with hypercolumns consisting of 104 units and 102 hypercolumns. From these figures it is apparent that number of hypercolumn chunks fit well into the frame of today’s cluster computers that have 102-103 processors . Parallelism
Grains
NN
1E+00
Hypercolumn
1E+02
Unit
1E+04
Weight
1E+08
Communication The communication network in a parallel computer or in hardware is designed in 1-2 dimension. Examples of 1D designs are the Ethernet network or communication bus architecture. Designs that use a 2D connectivity are the cellular neural network [7] and systolic array designs [8, 9]. Somewhere in between the 1D and 2D designs is the star coupled network design of Internet and systolic ring designs. Biological systems have a 3D design. In a Hopfield NN that has a full connection matrix the number of connections grows quadratically with respect to the number of units. Obviously, if the NN is large, using hardwired full connectivity is not possible [10]. This is true both for a 1D, 2D, and 3D communication grid design. A partial solution to the problem is to use sparse connectivity. If a communication bus architecture or Ethernet is used the units of the NN must share the same communication channel. The sparse connectivity does only fractionally ease the communication load. To decrease the communication load further multiplexed communication has to be used. The communication load can also be reduced by the use of heuristic knowledge about the communication in the NN and by a smart design of the NN. A commonly used method to achieve scalable systems is to use AER (Address Event Representation) [11, 12] in the inter unit communication protocol. AER means that the activity in the NN is represented as discrete spikes and the only information that needs to be transmitted over the network is the address or an identification of the
spiking units. The spiking communication is often also paired with heuristic knowledge about the communication in the NN. A commonly used method is to only allow units with a large enough change in their input to transmit their activity. With AER it is possible to run simulations with up to 106 units [13]. AER also enables a good load balancing which is important especially if the networks or the computational resources are heterogeneous [14]. The target platform of this study is parallel computers but the results on spiking units and AER from hardware implementations do also apply here. In the implementation presented in this report we use multiplexed AER communication together with an appropriate modularization of the NN to achieve a low communication load. The hypercolumn design provides an effective way of reducing the communication load while retaining the functionality of the NN. Issues of dynamic load balancing are not considered. Neither are issues such as communication congestion due to other computer programs running simultaneously considered. Our view of the hardware is that it is solely used for the purpose of running our NN algorithm. Parallel Computers At the end of the 1980’s and the start of the 1990’s massively parallel computers of SIMD (single instruction, multiple data) type were popular. The SIMD type of computers has a large number of simple processors, up to 65000 on the CM (Connection Machine) [15]. Later the MIMD (multiple instructions, multiple data) type of computers started to get popular. MIMD type of computers has fewer but more powerful processors than the SIMD type. The development of the MIMD type computers has gone from having a centralized to a more distributed structure. Today’s parallel computers have moved in the direction of clusters with an even more distributed architecture with an order of about 1000 processors. Implementing programs and achieving good performance on parallel computers is generally complicated [6]. Many aspects such as the memory structure and internode bandwidth, memory bandwidth, caches, processor architectures, internode communication network topology, buffers and communication protocols have to be considered. There exist several levels of parallelism. At the lowest level is node parallelism i.e. executing multiple instructions on the same processor, and at the highest-level cluster parallelism, which means that the computations are distributed onto multiple nodes. Today’s high performance processors such as Intel’s Pentium 4, AMD’s Athlon XP, SUN’s Ultra, and Motorola’s PowerPC have SIMD instructions [16]. The Pentium 4 has the MMX (multimedia extension) instructions for fixed-point arithmetic and the SSE and SSE2 (streaming SIMD extension) instructions for floating-point arithmetic. These instructions make it possible to perform the same computation on four different variables simultaneously. The corresponding instruction set used by AMD is the 3DNow! instructions. Generally the use of these instructions requires low level programming which is time consuming. Therefore these instructions have only been used in computational intensive applications such as graphics [17] and NN:s [18]. Using the parallelism in the processors will become more important in the future and especially in the light of Intel’s new IA64 architecture which gets its performance from using a high degree of parallelism. Parallel implementations of neural networks have mostly been done on SIMD computers such as the CM [2, 3, 19]. Each unit of the NN is usually mapped onto a single processor of the SIMD computer. This is a good solution since SIMD computers feature such large amount of processors. The memory of SIMD computers
is localized to the processors, which each has a small amount. The communication between the processors was usually based on a communication grid structure. The large amount of processors meant that a small computational grain size could be used which suited NN applications very well. To avoid a communication bottleneck was a big issue in the designs. When NN:s are implemented on MIMD type of computers things are more complicated. For instance many MIMD type computers usually have a distributed memory as opposed to a shared memory architecture. The distributed memory architecture means that there are constraints on what data is available where. A solution is to use an abstraction layer that hides the distributed structure of the memory but this usually has a negative effect on performance. Another solution is to program explicitly for the distributed memory architecture with message passing, but this is more complicated than programming for shared memory architectures. Even though there is performance to gain from avoiding use of the shared memory abstraction layer many NN implementations on MIMD type of computers still uses this abstraction [20]. The reason for this is that the NN have lacked a suitable structure for mapping them onto the more distributed architectures of MIMD computers. The cluster parallel computers that are popular today have a highly distributed architecture with relatively slow interconnections. Most applications that run on the clusters are implemented as a single program that runs on all nodes of the cluster (SPMD – single program, multiple data). SPMD applications use message passing between the programs. The communication in cluster computers is normally based on a regular Ethernet and thus it lacks a topology such as a grid. This means that minimization of the communication load is extremely important. Achieving a good load balancing so that the computation does not stall is also important as always in designs aimed at parallel computers. In this work we have focused on general-purpose computers and not considered hardware implementations or special purpose hardware for running NN:s [3]. There also exist hardware that is intended to accelerate the computations of the generalpurpose computer [21]. Performance Measures The efficiency of computer programs is measured in MFLOP or GFLOP (106 or 109 floating-point operations per second). The peak performance of the processor is known and the closer a program gets to that level of performance, the better it is. In a NN we are more interested in how many connections the computer can process per second than the number of actual computations preformed. In many of the NN algorithms the learning and retrieval phases are separated. Usually the learning phase is the most computational intensive because all weights have to be processed and updated. The common measure of performance is CUPS (connection updates per second). The retrieval phase is often more communication than computation intensive. The activity of each unit is propagated over the NN and new activity values of the units are computed. The weights only have to be multiplied with the activity and the speed by which this is done is called CPS (connection per second). The CUPS and CPS do not considered the number of operations that are preformed during an update; this varies a lot between different types of NN algorithms. The benefit of the CUPS and CPS measures are that they provide a common performance measure for NN implementations.
Almost all hardware implementations and also most of the implementations on parallel computers of NN have focused on the CUPS and CPS as a performance benchmark [22]. In hardware implementations the meaning of MFLOP is lost but it is sometimes used to link the performance of the hardware to that of a general-purpose computer. We do not consider the CUPS and CPS measures to be the most important ones. Instead we focus on the size of NN; our goal is to create as large a NN as possible. Parallel computations with Java Java is an object-oriented language with excellent support for network programming. Java is highly portable and has support for a wide variety of different platforms. The execution time of applications written in Java today compared to applications written in C is generally some where in the range 5-10 times longer and in some cases only 1.5 times longer [23]. This makes java an attractive language for many types of applications. What is lacking in Java is good support of implementations on parallel hardware. The performance lag of Java programs are basically due to the use of an abstraction layer between the processor and the program, the Java Virtual Machine (JVM). JVM performs array index out-of-bounds checks during execution, which reduces the performance. Another performance degrading operation is excessive type casting, and performance is usually better when the variables are doubles instead of floats. On parallel computers, thread synchronization operations can generate large performance reductions due to long wait periods. The instruction set of the latest processors that support SIMD operations are not available in or used by the JVM. There exist a number of JVM implementations [23-25] that are more efficient than the standard JVM provided from SUN. The Java language supports parallel processes in the form of threads and it adopts a shared-memory model. This makes it possible for Java to use the parallelism on a shared-memory parallel computer. The problem is that a naïve use of threads in Java can incur large overheads, which removes any gain of parallel execution. Implementing parallel Java programs on distributed-memory computers is more intricate. One can of course hide the distributed-memory structure of the parallel computer with an abstraction layer but that generally results in poor performance. Instead, we are more interested in solutions that acknowledge the physical structure of the computer and deals with it. There are a wide variety of proposals for how parallel computations with distributed memory can be implemented on clusters or even over the Internet with Java. The solutions can be categorized into these categories: • •
Single VM o Extend the Java language and rewrite the JVM in order to enhance the handling of threads. Multiple VM:s o Use native code to facilitate the communication e.g. MPI. o Implement the MPI API in pure Java code. o Use Java-applets that communicates on the Internet o A dedicated program that uses the communication capabilities of Java.
There are proposed extensions to the Java language and JVM that would enable Java programs to run parallel threads more efficiently [26] on shared-memory parallel computer. There are also extensions to the JVM that allows Java programs to run on a
distributed-memory parallel computer [27]. The arguments against the approach to parallel computing by enhancing the support and handling of threads in Java are that it usually involves changes to the language and the JVM [28]. There are basically two ways in which the MPI API can be made available in Java. The MPI API can be written in pure Java e.g. use the available TCP/IP sockets in java for communication [29, 30]. This enables the program to retain its portability but communication with the sockets in Java is usually much slower than that of native code implementations. The other way of incorporating the MPI API in java is to write a set of wrapper classes for the MPI API that uses the JNI (Java Native Interface) [3133]. Using native code means that the Java program looses its portability and that dynamic libraries have to be used. Since writing code that uses network communication is simple in Java it is an attractive option to write programs that fulfill their own needs of communication without having to use a dedicated API. Using the network communication directly in Java has the obvious advantage that message size overheads may be reduced. But more importantly it allows full control of how the communication is implemented. An attractive feature of network communication, to use in massively parallel systems, is the broadcast. The downside of writing your own code for communication is that there is a little more code in your program compared with using a dedicated API. The idea of including the code used for communication in the Java program means that there are no limitations to what type of processor is used to run the program on each node. It is possible to spread the program over the Internet where a waste amount of computing power can be found. The drawback with distributing a parallel program over the Internet is of course the extremely high latencies and the very different performance of the computers. There are a number of examples of how the Internet can be used for parallel computations. Some of these focus on building an infrastructure for running parallel programs [34-36]. The task of the infrastructure is to provide a homogenous interface to each computer used on the Internet. Communication in Java Two communication protocols are supported in Java; TCP/IP and UDP. TCP/IP is a point-to-point communication protocol that is widely used on the Internet. It is implemented on top of UDP and it provides secure and supervised transfers of data. It does not support broadcasts. UDP is very popular in the computer game industry since it is more efficient than TCP/IP. The efficiency of UDP comes with the price of less control of messages. Messages sent with UDP are not guaranteed to reach its destination neither is the time of arrival and timing between packages specified. UDP support multicasts, which we use to do broadcasts. A UDP broadcast means that the task of distributing a package sent from one node to all other nodes is relieved from the software and handed over to the hardware. This means a much shorter execution time. When using TCP/IP in Java for NN execution it may be necessary to set the option TCP_NODELAY true. If this option is set false, the network protocol will try to gather small messages to make a larger message before sending them. To achieve good performance it may also be necessary to adjust the buffer size used for sending and receiving messages.
MPI and PVM The standard of parallel programming on distributed memory computers is to use message passing. The two standards for message passing are MPI [37] and PVM [38]. Both have bindings to Fortran, C and C++. MPI was designed with the intention of providing a standard for message passing. Before the introduction of MPI all vendors of parallel computers had provided their own API for message passing. MPI has proved very successful and it provides a communication API that can exploit the topology in a parallel computer. Vendors of parallel computers like IBM offers MPI compliant API:s to their machines. There is also a free-ware program, MPICH [39], which implements the MPI API on clusters or LANs. MPICH uses TCP/IP as its default communication protocol. PVM was designed with the aim of providing a message passing API that could be used on a heterogeneous network of computers. The functionality in PVM is generally simpler than in MPI and there exits no support of communication topologies. Both the MPI and PVM standard supports collective communication such as the broadcast command. Depending on the MPI implementation the broadcast is implemented differently, e.g. in the standard MPICH implementation the broadcast is only an abstraction of multiple calls to the point-to-point send and receive commands. Another common way of implementing the broadcast is in the form of a tree communication routing. Usually computer vendors have implemented the MPI broadcast in the most efficient way for their computer [40, 41]. There have been proposals for a more efficient implementation of the MPICH broadcast that builds on the UDP multicast [42]. The problem for these solutions has been how the success of the transfer can be secured with out losing the performance. Some implementations have naively just hoped that the messages will arrive . All communication in MPI is synchronized; meaning that all messages in MPI must be received and cannot be dropped or ignored. This means that if a node A sends a message to a node B, node B must process the message. The application that uses MPI is thus protected from errors such as lost communication packages.
Modeling with BCPNN This report investigates the design and implementation of a special type of NN, the Bayesian Confidence Propagating Neural Network (BCPNN). The BCPNN learning rule uses local, correlation based, Hebbian learning. The BCPNN algorithm is derived from statistical considerations and the naïve Bayesian classifier. It can be used either as classifier, hetro- or autoassociative memory. When large NN:s are mapped onto parallel computers the usual computational grain has been each unit or the entire network, nothing in between. This makes it hard to get a well balanced mapping of multiple or large NN:s onto a parallel computer. The BCPNN have its units grouped into hypercolumns, which constitutes a computational grain between the size of a single unit and a full network. We show how the hypercolumns are used to make efficient implementation of large BCPNN:s. A feature of NN:s is that they are resistant to noise. We investigate how this property may be used in the implementation of the BCPNN on parallel computers. The noise tolerance means that less control of the communication between the nodes of a cluster is needed. The errors in the inter node communication are absorbed by the BCPNN.
A detailed description of the BCPNN learning rule is not given in this paper but it can be found in a number of other papers [43-46]. Populations, Hypercolumns and Projections The building blocks of a BCPNN are populations – groups of units, and projections – connections between two populations. The populations are the basic building blocks of an artificial nervous system. In biological terms each population can be thought of representing a specialized cortical area or neural substructure e.g. the dental gyros of the hippocampus. A population is made up of a number of units. Each unit is usually thought to correspond to a couple of hundred neurons. The units within a population are divided into one or several hypercolumns. The activity of the units within each hypercolumn is normalized. A population, and all of its units, has a number of defining properties; the gain factor used for the normalization of activity in the hypercolumns, the potential integration constant of the units. A projection can connect one population to itself or two different populations. If the projection only connects to one population it is said to be recurrent. A population with a recurrent projection can form an attractor memory. If the projection connects two different populations it is said to be a feed-forward projection. There are no limits to how many afferent and efferent projections a population may have. The main defining parameters of a projection are the plasticity constants. When a BCPNN system has been defined with all of its populations and projections, the system can be formulated as set of hypercolumns. The units and the weights of the afferent projections to these units then define a hypercolumn. Formulated more mathematically: in a matrix where every column corresponds to a specific unit in the BCPNN system and each element of the matrix is a connection from an other unit, a hypercolumn is then defined as a number of columns in that matrix. The implication of this in a parallel computer context is that the matrix can be split column wise and the only data that has to be available globally is the activity. It has also been showed that using a hypercolumn structure is a good activation threshold strategy for other NN:s like the Willshaw NN [47]. The hypercolumn structure somewhat reduces the information content of patterns compared to the case where a K-winner-takes-all structure is used. In the limit where the activity goes to zero the information contents of these two pattern types coincides. The hypercolumn structure also helps to reduce the number of spurious states [47]. Multiple Projections When a population has multiple afferent projections, i.e. it gets input from several populations, there is no exact description of the BCPNN learning rule. The derivation of the learning rule was done on the assumptions that it was used in a feed forward network. Later it was recognized that the learning rule generalized well into recurrent networks. In the basic formulation the support of a unit is intended to be a log probability. When a unit has several afferent projections connected to it, it also gets several support values. One issue is how these support values should be added; multiplicative or additive. Another issue is if these support values should be normalized before they are joined. Finally one also has to consider the treatment of the bias-terms, since a bias term exists in each of the support values.
The output activity of each hypercolumn is intended to correspond to a probability density function (pdf). This pdf is formed by a normalization of the support values1 of the units forming the hypercolumn. When there are several afferent projections there are also several support values for each unit. These support values can be combined either additive or multiplicative, either way that activity is always normalized and thus the output will always form a pdf. The quantitative property of using multiplicative combination is that the final output activity has a sharper distribution while addition generally gives a flatter distribution. When there are several afferent projections to a hypercolumn and the support values are added together the size of the populations from which these different projections originate plays a significant role. A projection from a large population generates a larger support than does a small population. This means that a large population has a greater influence on the output than a small population. If the support values of these different projections are normalized all projections, no matter the size of the populations they originate from, will have the same amount of influence on the output. One benefit of normalizing the support values before adding them is that different projections can have different gains i.e. they can generate sharp or smooth pdf:s. The support values are made up of two terms; a bias and a weight summation term. Usually the bias term is included in each projection, which means that when the support values from several projections are combined the bias is counted several times. There are two solutions to this problem; either the bias term can be moved out from the projection and be made a property of the units or the bias term can be divided by the number of afferent projections. But empirically we have noticed that the bias term is generally small in comparison to the weight summation term and therefore there is no need to adjust it. The most mathematically stringent and probably the best solution is to add the support values multiplicative. In order to avoid overflow in the computations, the maximum support value within each hypercolumn is subtracted from all support values before the new activity is computed. Spiking Units The activity in the BCPNN can be propagated in two ways. In the normal case the activity, which is a continuous value in the range [0,1], of each unit is propagated through out the BCPNN. If the BCPNN is run on a distributed parallel architecture, propagating each unit’s activity puts a high demand on the communication bandwidth between the nodes of the parallel computer. A way to the communication load is to use spiking units [48], i.e. AER. Using spiking units in a BCPNN means that only the index of the active unit in a hypercolumn needs to be propagated. The spiking unit in a hypercolumn is drawn randomly according to the activity distribution of all of the units in that hypercolumn. This will also simplify the BCPNN architecture. A variation of spiking units is to allow multiple units to spike simultaneously in a hypercolumn. In pattern association task multiple spikes have proved to be a bad solution. The benefit of multiple spikes is probably seen in tasks that require a graded output as opposed to the pattern association task. A network with multiple spiking hypercolumns would then acquire stable output faster than a single spiking hypercolumn network. 1
The potential of the units forming the hypercolumn is normalized and when the potential integration factor is close to zero the potential is identical with the support.
On the pattern association task, using single spiking hypercolumns has a positive effect on the result. It reduces the influence of gain parameter on the result. Implementations of NN that uses AER sometimes have event driven spiking in order to optimize performance. BCPNN does not implement event driven spiking; instead each hypercolumn generates a spike in every step of the relaxation. Using spiking units also gives the benefit of somewhat simpler equations of the BCPNN algorithm. Especially, the logarithm of the weights can be computed in the learning instead of the retrieval phase. Design of the Parallel BCPNN Software The most convenient way of modeling a BCPNN is to use populations and projections. But these are not well suited for a transparent and scalable mapping of the BCPNN onto the computational hardware. Hypercolumns are much better suited for this task. On the other hand the hypercolumns are not well suited for the interaction with the BCPNN. Therefore the BCPNN has to be represented both by populations and projections and by hypercolumns that are easy to map onto the computational hardware. We call the software that runs the parallel BCPNN simulator PBCPNN. When a PBCPNN program is started the population and projection objects are allocated together with the hypercolumn objects. A hypercolumn object contains both the units and the weights. When the whole BCPNN system has been setup with all of its populations and projections the hypercolumn objects are distributed to the clients. After the hypercolumns have been moved to the clients the memory, used by the vectors and weight matrices, is allocated. The population and projection objects still resides on the server and are used to interface the BCPNN system. Mapping of the BCPNN onto Hardware By the dividing a BCPNN of multiple populations and projections into hypercolumns it can relatively easily be mapped onto a distributed computational hardware. Each node or processor in the cluster is termed client and runs the computations of one or more hypercolumns. The memory load is approximately proportional to the computational load of a hypercolumn. The hypercolumns are randomly mapped onto the clients so that each client roughly gets the same amount of connection weights (memory load). This method gives a good, but simple, static load balancing. The random mapping means that there is no topology in the communication between the clients. The distributed hardware is intended to be some form of a cluster with high communication latencies between the clients. The hypercolumn structure and the spiking representation of the activity means that the bandwidth required in the cluster is relatively low. This makes it possible to have a large number of clients that uses the same communication channel. The communication operation performed on this communication channel is mostly broadcasts, which are used to update the activity of the BCPNN. Load balancing is an important issue to get good performance. We have assumed that the computational power is equal of all clients, which will often not be true if a large number of clients are used. A similar effect to that of having nodes with different computational power is if the nodes run other programs simultaneously. An effective load balancing will have to dynamically correct for these problems. Other load balancing issues not considered are that some hypercolumns may be more dynamic and active than others and thus require more computations. All these things are issues that need to be addressed in order to achieve good load balancing. An
interesting heuristic approach is to put dependent hypercolumns onto the same client or nearby clients so that there will be a communication topology in the network. But in this work we have not implemented any dynamic load balancing instead we have assumed that the computations are run on a dedicated homogenous hardware. Projection 1
Client 1
Client 3
Hypercolumn 1D
Projection 1
Projection 1
Projection 3A
Projection 2A
Projection 1
Projection 3B
Projection 2B
Projection 1
Hypercolumn 1C
Hypercolumn 2A
Hypercolumn 1B
Hypercolumn 2B
Projection 2
Projection 3
Population 2 Hypercolumns A and B
Mapping of the BCPNN onto hardware
Hypercolumn 1A
Population 1 Hypercolumns A, B, C, and D
Client 2
Figure 1 A BCPNN system with two populations and three projections is mapped onto a distributed computational hardware, which has three clients (processors). Projection 1 is divided into four hypercolumns and projection 2 and 3 are divided into two hypercolumns. The size of the two populations is assumed to be equal. The hypercolumns are randomly distributed onto the clients.
In Figure 1 a BCPNN with two populations and three projections are mapped onto three clients. There is a total of six hypercolumns in the system and these are randomly mapped onto the three clients. The relation between hypercolumns and projections are showed in the figure. The number of units in the two populations is assumed to be equal, which means that all projections are of equal size. Centralized and Distributed Communication Most of the communication resources are used for updating the activity in the BCPNN. We investigated one centralized and one distributed implementation of the activity update process. The centralized solution was implemented with both the MPI API and TCP/IP protocol. The distributed solution was implemented with the UDP protocol. Both the TCP/IP and UDP protocols are used in the distributed solution. All communication involved in the activity update process was done with the UDP protocol while communication for administration purposes was done by TCP/IP. This means that we know that the correct patterns were trained. We also used TCP/IP to transfer the initially noisy patterns to all clients before the retrieval and activity update process was initiated. All messages containing activity data requires a header of 12 bytes. The header contains information on what type of message it is and from which client and hypercolumn it originates. In the centralized solution (Figure 2) the full activity vector is sent from the server to each client. Then each client computes its new activity and resends the portion of
the activity vector that it has updated. The process of spreading and updating the activity through out the network required the amount of data transmitted, Tcent, given by eq.(1), where N is the total activity message length and C is the number of clients. Tcent = C ( N + N / C + 24) bytes
(1)
The length of the total activity message length, N, depends on the size and number of hypercolumns. If continuous valued activity is used the size of N is equal to the total number of units but if spiking activity is used the size of N is equal to the total number of hypercolumns. Server sends the activity to each client
Server receives a part of the activity from each client
Server
Server
1 3N
1 3N N
Client
N
N
Client
1 3N Client
Client
Client
Client
Figure 2 A centralized solution of the activity update process. This solution was implemented with both the TCP/IP protocol and MPI API. The amount of data that needs to be transmitted in each update grows with the number clients times the size of the network. Server sends a activityupdate request to all clients
All clients broadcasts their part of the activity
Server
Server
1 3N
1 3N
1 Client
Client
Client
Client
1 3N Client
Client
Figure 3 A distributed solution of the activity update process. This solution was implemented with the UDP protocol. The amount of data that needs to be transmitted in each update grows only with the size of the network.
In the distributed solution with UDP (Figure 3), the role of the server in the process of updating the activity has largely been removed. The update process is initiated by a short command sent by the server. Each client that receives this command computes its new activity and broadcasts it on the network. All clients that receive this broadcast updates its own activity vector. Messages that are delayed are sorted out and ignored. Messages that are lost are ignored and they are not resent. The amount of data transmitted, Tdist, on the network during one activity update cycle is Tdist = C ( N / C + 12) bytes
(2)
From eq. (1) we know that the required bandwidth for the centralized TCP/IP communication solution roughly grows as NC+N. Eq. (2) gives that the required bandwidth for the distributed UDP solution is N. This means that the UDP solution roughly remains constant with respect to C while the TCP/IP solution scales up linearly with C. Training and Retrieval In the experiments a distinction is made between the training and retrieval phase. The training phase is generally more computational intensive while the retrieval phase is generally more communication intensive. In a more elaborate design of the BCPNN algorithm these two phases are combined. In order to reduce the number of possible sources of errors the communication during the training phase was always done with TCP/IP. During each cycle of the training phase a pattern of activity was distributed to each client. On each client a small part of the total connection matrix was updated. The retrieval phase starts with a synchronization of all clients. When the clients are synchronized a noisy retrieval cue is sent to each client with TCP/IP. After the arrival of the retrieval cue the clients starts to update their activity. This update can be done both with TCP/IP and UDP communication. After the update request is sent by the server, it waits for the clients to respond on request. If the server has not received a response from all clients within a certain timeout period a new update request is sent. The time-out is initially set high and decreases after a while. It is computed as a running average of the longest response times plus a bias. If a client fails to respond the time-out is increased. Each UDP package has a time-stamp and packages that are late are dropped. Hardware The experiments were performed on three different computers, two Linux clusters (Roxette and SBC) and one IBM SP2 (Strindberg). Strindberg has around 170 processors of various types. The experiments were run on POWER2 processors. The POWER2 nodes in Strindberg are connected with a high performance network with a bandwidth > 1GBps. Roxette is a Linux cluster with 11 nodes, where each node has a Pentium III processor at 866 MHz and 256 MB of memory. The nodes are connected with a 100 MBps Ethernet network. SBC is a Linux cluster with over 160 processors. The experiments were run on a part of the cluster that has 80 Athlon XP processors at 1400 MHz and 768 MB of memory. The 80 nodes are grouped into 4 separate groups. Within each group the nodes are connected with a 100MBps Ethernet network. The 4 groups are connected with a 1GBps Ethernet connection. The new part of the cluster has 112 nodes with Athlon XP processors at 2166 MHz and 1024 MB of memory.
Results Messages that are sent by UDP are not guaranteed to arrive. In the following experiments the loss rate of UDP packages was never higher than about 1%. Only the older 1400MHz nodes were used on the SBC cluster. The results are divided into four sections. The first section investigates different types of palimpsest, auto-associative memories. The second section investigates the communication capabilities of three different communication protocols. The third section investigates the behavior of the BCPNN algorithm fitted into a parallel framework of two of these communication protocols. The last section proves the BCPNN performance on real data. Motivating the BCPNN Algorithm If one wants to implement a recurrent, auto-associative NN with palimpsest properties there are basically three types to choose between: Willshaw, Hopfield, and BCPNN. We have compared these three NN in order to characterize their efficiency in terms of storage capacity and running times. The NN we used had 100 units divided into 10 hypercolumns. The training set was comprised of 1000 random patterns and on retrieval one hypercolumn was erroneous. The Willshaw NN has binary weights and it gets its palimpsest property from a random reset of the weights back to zero. The problem with binary weights and palimpsest memories is that there is no information about how long a memory has been in storage. This means that the most recently learnt memory has an equal chance of being removed as an earlier learnt memory, which leads to a low capacity [49]. The probability of resetting a weight was set to 0.025. The palimpsest property can be introduced in a number of ways to the Hopfield NN. We chose to use clipped weights, with the clipping constant set to 0.8. The BCPNN had its time constant, τP, set to 25. All networks were implemented as effectively as possible. Table 2 A comparison of three palimpsest associative memories trained with sparse patterns. The training set had 1000 patterns and the task was to correctly recall as many patterns as possible from retrieval cues with erroneous hypercolumn. The memories were divided into 10 hypercolumns with 10 units in each, which made the total number of units 100.
Retrieved patterns Information / Connection (bits) Operations / Connection Information / Operations Training time (s) Iterations on Retrieval
Willshaw
Hopfield
BCPNN
10
44
60
0,074 3 0,025 3,79 3,47
0,32 5 0,065 0,25 2,65
0,44 8 0,055 0,17 2,11
The results in Table 2 shows that the BCPNN has the highest storage capacity. But the Hopfield NN has a higher ratio of information per connection and operation. Yet, the training time is longer for the Hopfield NN due to the branch that occurs when the over-underflow condition of the weights is tested. Clearly the Willshaw NN can be
very efficiently implemented in hardware but its poor palimpsest properties make it a poor choice. The strongest motivation of using the BCPNN algorithm is its high performance in terms of storage capacity and to some extent its fast execution time. Communication Only These experiment were intended to compare the performance and scaling capabilities of MPICH, Java TCP/IP, and Java UDP communication on different platforms. The two important findings were that the communication latencies were high for the Java applications and the scaling properties of UDP communication were very good. Figure 4 shows the number of activity updates that were achieved with different communication protocols on a number of different platforms using 10 clients. Not surprisingly the C-MPI implementation run on the high-speed network of the IBM SP is by far the fastest. The update speed of the C-MPI program on the IBM SP is a factor 3 faster than the C-MPICH program on a 100 MBps Ethernet but no more than a factor 2 faster than the Java-UDP program. The by far slowest is the Java-TCP program. 3500
Updates/s
3000 2500 2000 1500 1000 500 0
Strindberg Roxette - MPI MPICH
SBC MPICH
Roxette Java TCP/IP
SBC Java TCP/IP
SBC Java UDP
Figure 4 Update speed compared between a number of different computers and implementations. 100 hypercolumns with spiking activity was simulated. This experiment was run on 10 clients and 1 server nodes.
Figure 5 shows the network bandwidth used. Again we see that the C-MPI program run on the IBM-SP put up the highest figures. The network utilization of the C-MPI is about 25% of the theoretically maximal capacity. The C-MPICH implementations used almost 100% of the theoretical network bandwidth capacity. The Java-TCP and Java-UDP programs used only about 15% of the theoretical network bandwidth capacity. Yet it was able to beat the C-MPICH program on the update speed due to its more efficient use of the bandwidth (Figure 4).
250
Mbits/s
200
150
100
50
0
Strindberg Roxette - MPI MPICH
SBC MPICH
Roxette Java TCP/IP
SBC Java TCP/IP
SBC Java UDP
Figure 5 Utilization of the communication network bandwidth. This data is based on those in Figure 4.
In Figure 6 and Figure 7 the scaling of the three communication programs is shown. The scaling and performance of the Java-UDP program stands out compared with that of the C-MPICH and Java-TCP programs. The C-MPICH and Java-TCP programs scales relatively equally, the only difference is a constant factor as expected. Figure 7 shows that the bandwidth utilization goes down for both the C-MPICH and the JavaUDP implementations with an increasing number of clients. In the case of the JavaTCP the bandwidth utilization remains constant, which is probably due to the already very high latencies involved in the Java-TCP communication. MPICH TCP UDP 3
Updates/s
10
10
2
10
20
30
40
50
60
Clients
Figure 6 Comparison of the number of updates per second on the SBC computer. The scaling experiments starts with 2 clients and goes up to 64 clients. 100 hypercolumns with spiking activity was simulated.
80
MPICH TCP UDP
70
MBits/s
60 50 40 30 20 10 0
10
20
30
40
50
60
Clients
Figure 7 The results from Figure 6 reformulated to show the network bandwidth used.
Parallel BCPNN We use the terms “UDP update” and “TCP update” to indicate where we only measured the time of the activity updates and computations and not the initial synchronization and setup of the retrieval cue. The training phase was always done with TCP/IP communication. In Figure 8 a BCPNN with 18000 units is run on different number of clients. This number of units means that it is too large to be run on a single client. Approximately 2.5 GB of memory is used by the BCPNN. The results show that the UDP solution scales perfectly at least up to 60 clients. We have not used more clients than that. TCP update UDP update
10 9
Iterations / s
8 7 6 5 4 3 2 1 0
0
10
20
30
40
50
60
Clients
Figure 8 The retrieval update rates of both TCP/IP and UDP based communication. The time used for transmission of the initial retrieval cue and the synchronization
(between the retrieval of different patterns) are not included. The BCPNN have 18000 units partitioned into 180 hypercolumns. The results in Figure 9 include both the initialization and synchronization processes of the retrieving update cycle. The scaling of the retrieval process is fairly good up to 30 clients after which more clients only fractionally increase the update speed. The results from the training iterations are also shown and they indicate that it scales fairly well up to 30 clients. Retrieve TCP Retrieve UDP Training
5
Iterations / s
4
3
2
1
0
0
10
20
30
40
50
60
Clients
Figure 9 The rate of updates for both the training and retrieval phase. Here the total time is presented including initialization and synchronization of the processes. The BCPNN have 18000 units partitioned in 180 hypercolumns. Figure 10 shows the number of processed connections (i.e. weights) per client and second. The results show that the training phase is about a factor 3 more computationally intensive than the retrieval phase. The training phase always uses TCP/IP communication and therefore it scales proportionally to the TCP based retrieval. The scaling of the UDP based retrieval is good and the connection update speed is almost maintained when the number of clients is increased. The amount of memory allocated on each client goes from about 500MB when 5 clients are used to about 40MB when 60 clients are used. Table 3 shows the CPS measured on a single processor running a Java implementation of the BCPNN dedicated for a single processor. The BCPNN has 6400 units and allocates about 300MB of memory. The performance is slightly lower than for a single processor running the parallel BCPNN. The performance difference may be due to fewer branches in the larger parallel BCPNN. Table 3 The connection update speed of a single processor implementation of the BCPNN. The results are from running a BCPNN with 6400 units partitioned into 80 hypercolumns.
CPS
Train
Retrieve
1,6E+07
4,3E+07
Table 4 The connection update speed of 60 processors running a BCPNN with 18000 units partitioned into 180 hypercolumns. Train CPS x 10
6,0E+08
TCP update UDP update 2,2E+09
3,5E+09
7
Connections / (Processor ⋅ s)
6
Train UDP update TCP update
5.5 5 4.5 4 3.5 3 2.5 2 1.5 1
0
10
20
30
40
50
60
Clients
Figure 10 The number of connection processed per client. The retrieve times do not include initialization and synchronization. The BCPNN have 18000 units partitioned in 180 hypercolumns. The time of a single iteration in BCPNN:s of different sizes are shown Figure 11. The experiment is run on 60 clients. The results are shown for the training, full retrieve cycle, and only the UDP update cycle without initialization and synchronization. The gap between the UDP update and the full retrieval time shrinks, as the BCPNN gets larger and the computational time in each cycle become longer. The sharp rise in the training time when the number of nodes starts to approach 7·104 is due to memory swapping. The memory used for the P-variables (the moving averages) is swapped out when the weights are computed. The total memory used by the BCPNN in the last run with 67200 units is 34 GB.
Time / Iteration (s)
10
10
10
Train Retrieve UDP update
1
0
-1
1
2
3
4
5
6
Units
x 10
4
Figure 11 60 clients are used to run successively large BCPNN:s. The UDP update phase do not include initialization and synchronization of the retrieval phase. The sharp increase of the training time when the number of units is close to 7·104 is due to swapping of memory. Figure 12 shows a theoretical calculation of how large BCPNN:s can be fitted into the memory of the newest nodes in the SBC cluster. These nodes have 1024MB of RAM and 800MB of RAM is assumed to be available for the program. The calculation gives that a BCPNN with 105 units can be run on 100 of the new nodes. 5
units
10
10
4
20
40
60
80
100
clients with 800MB of free RAM
Figure 12 The maximum number units that can be run plotted against the number of clients required. The clients are supposed to have 800MB of free RAM. The SBC cluster has 112 nodes with a capacity that match those of this calculation. Picture Data A BCPNN with 16384 units partitioned into 1024 hypercolumns with 16 units in each hypercolumn is used to store 16 grayscale images. The color-depth of the images is 16 gray levels and the size of the images is 32x32 pixels. The results indicate that about 60-70% of the pixels can be randomly flipped before the BCPNN is unable to recall the images.
With 50% noise (Figure 13) all of the 16 images are recalled correctly. With 60% noise (Figure 14) 14 images are recalled correctly, with 70% noise (Figure 15) 6 images are recalled correctly, with 80% noise (Figure 16) 1 image is correctly recalled, and with 90% noise no images are recalled correctly.
5
5
1 0
10
1 0
1 0
1 5
15
1 5
1 5
2 0
5
20
5
2 0
2 0
2 5
25
3 0
2 5
30
5
1 0
1 5
2 0
2 5
5
5
1 0
1 5
2 0
2 5
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
2 5
5
5
5
1 0
1 0
1 0
1 5
2 0
2 5
1 5
2 0
2 5
1 5
2 0
2 5
2 0
25
1 0
1 5
2 0
2 5
5
2 0
2 5
2 0
2 5
5
2 0
2 5
2 5
3 0
5
5
1 0
1 0
5
1 0
1 5
2 0
2 5
1 5
2 0
2 5
3 0
5
10
1 0
1 0
1 0
10
1 0
1 0
15
1 5
1 5
1 5
15
1 5
1 5
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
5
10
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
30
5
5
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
1 5
2 0
2 5
3 0
5
5
1 0
10
1 0
10
1 0
10
1 5
15
1 5
15
1 5
15
1 5
15
2 0
20
2 0
20
2 0
20
2 0
2 5
25
2 5
25
2 5
25
2 5
3 0
30
3 0
30
3 0
30
3 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
10
1 5
2 0
2 5
3 0
5
10
15
20
25
5
30
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
10
15
20
25
30
3 0
10
10
1 0
5
2 5
5
1 0
5
30
2 0
3 0
5
2 5
5
2 0
1 5
2 0
3 0
10
1 0
1 0
1 5
2 5
5
1 5
5
1 0
2 0
3 0
5
5
1 5
3 0
3 0
5
2 0
2 5
30
1 5
1 5
2 0
25
1 0
10
1 5
20
5
3 0
3 0
15
30
5
1 5
10
3 0
1 5
1 0
5
2 5
1 0
2 5
3 0
3 0
2 0
5
2 5
30
5
1 5
3 0
5
1 5
2 0
1 0
2 5
10
5
5
3 0
5
1 0
1 5
20
30
2 0
3 0
5
1 0
1 5
2 5
5
1 0
15
2 0
3 0
3 0
3 0
3 0
5
2 0
2 0
25
30
5
1 5
1 5
20
2 5
3 0
10
5
10
1 5
2 5
3 0
3 0
5
5
2 5
3 0
3 0
5
1 0
20 25
30
3 0
5
10
1 5
2 0
2 5
3 0
5
Figure 13 50% noise in the retrieval cues.
5
5
5
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
5
20
5
2 0
2 0
2 0
20
2 0
2 0
2 5
25
3 0
2 5
30
5
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
3 0
5
1 5
2 0
2 5
3 0
1 0
1 5
2 0
2 5
1 5
2 0
2 5
30
1 0
1 0
1 0
1 5
2 0
2 5
3 0
2 0
2 5
3 0
1 0
1 5
2 0
2 5
30
5
5
1 5
2 0
2 5
3 0
15
2 0
2 5
3 0
1 5
2 0
2 5
3 0
5
5
1 0
1 0
1 0
1 5
2 0
2 5
3 0
5
1 5
2 0
2 5
3 0
5
5
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
1 5
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
3 0
2 0
2 5
3 0
1 0
1 5
2 0
2 5
1 0
1 5
2 0
2 5
30
5
5
5
1 0
10
1 0
3 0
2 0
2 5
3 0
2 5
3 0
5
10
15
20
25
1 0
1 5
2 0
2 5
3 0
10
15
1 5
2 0
2 5
3 0
2 0
2 5
3 0
5
5
1 0
10
1 5
2 0
2 5
3 0
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
10
15
20
25
30
5
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
10
15
20
25
30
15
20 25
3 0
1 0
5
3 0
1 5
2 0
5
30
2 5
10
2 5
30
1 0
2 5
2 0
5
1 5
20
25
5
30
5
5
3 0
30
1 5
2 0
2 5
25
10
1 5
2 0
20
5
1 0
1 5
15
3 0
5
5
5
2 5
30
5
1 5
2 0
25
3 0
10
1 5
20
2 5
5
2 0
3 0
10
10
1 0
1 5
2 5
5
1 0
5
1 0
1 5
3 0
5
5
2 0
2 5
30
1 5
10
2 0
25
1 0
3 0
5
1 5
20
5
5
1 0
10
3 0
5
2 5
3 0
5
5
2 5
3 0
1 5
1 5
2 0
2 5
10
1 0
1 5
2 0
5
2 5
30
5
5
3 0
5
1 0
5
3 0
5
25
3 0
5
5
2 5
30
5
10
2 0
25
3 0
2 5
3 0
5
1 5
20
2 5
2 5
3 0
5
5
30
5
10
1 5
2 0
2 5
3 0
5
Figure 14 60% noise in the retrieval cues.
5
5
5
5
5
5
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 0
2 5
25
2 5
2 5
2 5
25
2 5
2 5
3 0
30
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
3 0
5
1 0
1 5
2 0
2 5
30
30
5
5
5
5
1 0
1 5
2 0
2 5
3 0
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
5
1 0
1 5
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
3 0
5
2 0
2 5
3 0
1 0
1 5
2 0
2 5
1 5
2 0
2 5
30
5
5
1 0
1 0
1 0
3 0
5
1 0
5
1 5
10
1 5
2 0
2 5
3 0
2 0
2 5
3 0
1 5
2 0
2 5
30
1 5
2 0
2 5
3 0
10
15
1 0
1 5
2 0
2 5
3 0
2 5
3 0
10
1 5
2 0
2 5
3 0
5
5
1 0
1 0
1 5
2 0
2 0
2 5
30
5
2 0
5
1 5
20
25
3 0
1 0
1 0
5
2 0
5
5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
3 0
10
1 5
2 0
2 5
3 0
5
1 0
10
1 0
10
1 0
10
1 0
10
1 5
15
1 5
15
1 5
15
1 5
15
2 0
20
2 0
20
2 0
20
2 0
2 5
25
2 5
25
2 5
25
2 5
3 0
30
3 0
30
3 0
30
3 0
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
10
1 5
2 0
2 5
3 0
5
10
15
20
25
30
Figure 15 70% noise in the retrieval cues.
5
5
5
5
5
1 5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
5
5
2 5
30
5
1 5
2 0
25
3 0
10
1 5
20
2 5
5
5
1 0
1 5
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
20 25
30
5
10
1 5
2 0
2 5
3 0
5
5
5
5
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
5
20
5
2 0
2 0
2 0
20
2 0
2 0
2 5
25
3 0
2 5
30
5
1 0
1 5
2 0
2 5
3 0
5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
2 5
3 0
5
10
1 5
2 0
2 5
3 0
5
5
25
3 0
5
1 0
1 5
2 0
2 5
30
5
2 5
30
5
1 0
1 5
2 0
2 5
3 0
5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
5
1 0
1 5
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
5
10
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
30
5
5
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
2 0
2 5
3 0
5
5
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
3 0
2 0
2 5
3 0
1 0
1 5
2 0
2 5
1 0
1 5
2 0
2 5
30
5
5
5
1 0
10
1 0
3 0
1 5
2 0
2 5
3 0
2 0
2 5
3 0
10
15
20
25
1 5
2 0
2 5
3 0
15
1 5
2 0
2 5
3 0
2 0
2 5
3 0
5
5
1 0
10
1 5
2 0
2 5
3 0
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
10
15
20
25
30
5
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
5
1 0
1 5
2 0
2 5
30
10
15
20
25
30
15
20 25
3 0
1 0
5
3 0
1 5
2 0
5
30
2 5
10
2 5
30
1 0
2 5
2 0
5
1 5
20
25
5
30
1 0
10
3 0
5
5
5
2 5
30
10
1 5
2 0
20 25
5
1 0
1 5
15
3 0
5
5
5
2 5
30
5
1 5
2 0
25
3 0
10
1 5
20
2 5
5
2 0
3 0
1 5
1 0
1 5
1 5
2 5
10
10
1 0
1 0
2 0
5
1 0
5
5
5
30
5
10
1 5
2 0
2 5
3 0
5
Figure 16 80% noise in the retrieval cues.
5
5
5
5
5
5
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 0
2 5
25
2 5
2 5
2 5
25
2 5
2 5
3 0
30
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
3 0
5
1 0
1 5
2 0
2 5
30
30
5
5
5
5
1 0
1 5
2 0
2 5
3 0
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
5
5
1 0
10
1 0
1 0
1 0
10
1 0
1 0
1 5
15
1 5
1 5
1 5
15
1 5
1 5
2 0
20
2 0
2 0
2 0
20
2 0
2 5
25
2 5
2 5
2 5
25
2 5
3 0
30
3 0
3 0
3 0
30
3 0
5
1 0
1 5
2 0
2 5
3 0
5
5
1 0
1 5
2 0
2 5
3 0
5
1 0
10
1 5
15
2 0
1 0
1 5
2 0
2 5
3 0
5
2 0
2 5
3 0
1 0
1 5
2 0
2 5
1 5
2 0
2 5
30
5
5
1 0
1 0
1 0
3 0
5
1 0
5
1 5
10
1 5
2 0
2 5
3 0
1 5
2 0
2 5
3 0
1 5
2 0
2 5
30
5
5
2 0
2 5
3 0
1 5
2 0
2 5
3 0
2 5
3 0
10
1 5
2 0
2 5
3 0
5
5
1 0
1 0
1 5
2 0
2 0
2 5
30
1 0
2 0
5
1 5
20
25
5
5
1 5
15
3 0
1 0
1 0
10
2 0
5
5
5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
5
5
2 5
30
5
1 5
2 0
25
3 0
10
1 5
20
2 5
5
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
3 0
5
10
1 5
2 0
2 5
3 0
5
5
1 0
10
1 0
10
1 0
10
1 0
10
1 5
15
1 5
15
1 5
15
1 5
15
2 0
20
2 0
20
2 0
20
2 0
2 5
25
2 5
25
2 5
25
2 5
3 0
30
3 0
30
3 0
30
3 0
5
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
5
10
1 5
2 0
2 5
3 0
5
10
15
20
25
30
5
1 0
1 5
2 0
2 5
3 0
5
1 0
1 5
2 0
2 5
3 0
20 25
30
5
10
1 5
2 0
2 5
3 0
5
Figure 17 90% noise in the retrieval cues.
Future Work Using a sparse connection matrix has two interpretations. Generally it means that a full connection matrix is used during training. After training some of the weights are randomly removed. We now say that connection matrix is diluted. The remaining weights of this connection matrix are stored in an efficient way. Diluting the connection matrix is often used as a test of NN:s error redundancy. This approach does not address the issue of incremental learning. Using an appropriate threshold function is very important when the connection matrix is diluted. In our approach to using a sparse connection matrix we look at the case were the load on the NN is low. This means that the number of correlations (correlations greater than zero) in the connection matrix is low. We also assume that we use a palimpsest NN, i.e. that old memories are replaced with new ones. This means that constant load level will always be low even after training a great number of patterns. The connection matrix is represented only by a few weights that are dynamically allocated during learning. Each unit in the NN is only allowed to have a certain number of incoming connections stored in the unit’s connection memory (UCM). The UCM act as a queue. If the UCM is full when a new memory is to be learnt the oldest weight in the UCM is removed. If several memories make use of the same weight, the
priority of that weight may be increased. This means that the UCM queue is affected both by time and by magnitude of the correlation. In this approach the connection matrix is not diluted. Instead we use the fact that only a few correlations are made between the units of the NN. As a consequence no changes have to be made to the threshold function as in the case with a diluted connection matrix. Implementing the BCPNN rule with this approach means that the actual weights are only computed when needed.
Discussion Compared to other palimpsest, auto-associative NN:s the BCPNN perform very well. It has a large storage capacity and also supports a relatively efficient implementation. Further more it is robust to noise and somewhat insensitive against errors in the activity communication. The distributed UDP communication solution for broadcasting the activity gives an advantage over both MPICH and TCP implementations of the broadcast. This advantage is achieved although the communication latencies are higher for UDP-Java than MPICH-C. The advantage appears if more than about 5 clients are used. With UDP communication the scaling is linear up to at least 60 clients. The small errors that were incurred by the UDP communication did not affect the BCPNN algorithm. The experiments with the noisy images show this. The hypercolumns concept proved to be a key technology in reducing the communication load. The communication was not a bottleneck and this was proved by the CPS performance of each separate processor. The CPS performance was not degraded compared to a single processor implementation of the BCPNN. If smaller BCPNN:s were run the communication could become a bottleneck. But with the size of BCPNN:s we used the execution was computation bounded. Large, fully connected, BCPNN:s with up to 6·104 units were implemented. Theoretical computations shows that BCPNN:s with up to 105 units can be implemented on the SBC computer. An extrapolation of the results on the iteration times suggests that a BCPNN with 105 may have an iteration time of less than 60s. The maximal size of a BCPNN if we intended to have iteration times less than 100 ms is about 104 to 2·104 units. An interesting subject is to implement a BCPNN with a sparse weight matrix. A sparse weight matrix implementation will reduce the memory requirements of the BCPNN, which means that larger BCPNN:s can be created. Since the memory load on a large BCPNN generally is far below the limit of its capacity a sparse weight matrix implementation does not mean that the weight matrix has to be diluted. It would be interesting to study the performance of a BCPNN with a connectivity of less then 10%. A not surprising result was that synchronizations and TCP/IP communication dragged the performance of the UDP communication. This is solved with a more sophisticated implementation of the BCPNN, which does not separate the training phase from the retrieval phase. For large enough BCPNN:s the choice of communication protocol does not matter. If the computational times are more than a second TCP/IP based communication fulfills the needs of activity update speed. But if we want to have iteration times of
less than 100ms a more effective communication protocol such as UDP has to be used. A problem with using Java is the performance. The latencies in the communication are high. TCP/IP communication with Java is almost a factor 10 slower than with a C implementation such as MPICH. The computational performance of Java is also poor compared to C, which is almost a factor 10 faster also in that aspect. A restriction with using Java is that the SIMD instructions of modern processors cannot be used. In the introduction to parallel computers we proposed that multilevel parallelism is important and with Java we can only use one level of parallelism. The communication load in the parallel BCPNN can be reduced even more with event driven spiking. Event driven spiking means that a unit only spikes or send its activity if its support is changed. This can give substantial gains in a large NN where only a few units change their activity.
References 1. 2. 3. 4.
5. 6. 7. 8. 9. 10.
11.
12.
13. 14.
15. 16. 17. 18. 19. 20.
21.
22.
Hammerstrom, D. Computational Neurobiology Meets Semiconductor Engineering. in Invited Paper Multi-Valued Logic Conference. 2000. Portland, Oregon. Sundararajan, N. and P. Saratchandran, Parallel Architectures for Artificial Neural Networks: Paradigms and Implementations. 1998: Wiley. Misra, M., Parallel Environments for Implementing Neural Networks. Neural Computing Surveys, 1997. 1: p. 48-60. Nordström, T. and B. Svensson, Using and designing massively parallel computers for artificial neural networks. Journal of Parallel and Distributed Computing, 1992. 14(3): p. 260-285. Hopfield, J.J., Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA, 1982. 79(8): p. 2554-8. Hammarlund, P., Techniques for Efficient Parallel Scientific Computing, in Nada. 1996, KTH: Stockholm. Chua, L.O. and L. Yang, Cellular Neural Networks: Theory. IEEE Transactions on Circuits and Systems, 1988. 35(10): p. 1257-1272. Margaritis, K.G., On the systolic implementation of associative memory artificial neural networks. Parallel Computing, 1995. 21: p. 825-840. Kung, S.Y. and J.N. Hwang. Parallel Architectures for Artificial Neural Nets. in Proceedings of IEEE ICNN88. 1988. Hammerstrom, D. The Connectivity Requirements of Simple Association, or How Many Connections Do You Need? in IEEE Conference on Neural Network Information Processing. 1987. Mattia, M. and P.D. Giudice, Efficient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical Synapses. Neural Computation, 2000. 12: p. 23052329. Deiss, S.R., R.J. Douglas, and A.M. Whatley, A Pulse-Coded Communication Infrastructure for Neuromorphic Systems, in Pulsed Neural Networks, W. Maass and C.M. Bishop, Editors. 1999, MIT Press. p. 157-178. Delorme, A., et al., SpikeNet: S Simulator for Modeling Large Networks of Integrate and Fire Neurons. Neurocomputing, 1999. 26-27: p. 989-996. Graßmann, C. and J.K. Anlauf, Fast Digital Simulation Of Spiking Neural Networks And Neuromorphic Integration With Spikelab. International Journal of Neural Systems, 1999. 9(5): p. 473 - 478. Levin, B., On Extensions, Parallel Implementation and Applications of a Bayesian Neural Network, in Nada. 1994, KTH: Stockholm. Strey, A. and M. Bange. Exploiting the data-level parallelism in modern microprocessors for neural network simulation. in ParCo 2001. 2001. Naples. Ma, W.-C. and C.-L. Yang. Using Intel Streaming SIMD Extensions for 3D Geometry Processing. in IEEE Pacific Rim Conference on Multimedia. 2002. Strey, A. and M. Bange. Performance Analysis of Intel's MMX and SSE: A Case Study. in Conference on Parallel Computing (Euro-Par 2001). 2001. Manchester. Serbedzija, N., Simulating Artificial Neural Networks on Parallel Architectures. Computer, 1996. 29(3): p. 56-63. Boniface, Y., F. Alexandre, and S. Vialle. A bridge between two paradigms for parallelism: Neural Networks and general purpose MIMD computers. in International Joint Conference on Neural Networks. 1999. Hammerstrom, D., W. Henry, and M. Kuhn. The CNAPS Architecture for Neural Network Emulation. in Parallel Digital Implementations of Neural Networks. 1993. Engelwood Cliffs, NJ: Prentice Hall. Roth, U., A. Jahnke, and H. Klar. On-Line Hebbian Learning for Spiking Neurons: Architecture of the Weight-Unit of NESPINN. in ICANN. 1997.
23.
24. 25. 26.
27. 28. 29. 30. 31. 32. 33.
34.
35. 36. 37. 38.
39. 40. 41.
42.
43.
44. 45. 46.
Moreira, J.E., S.P. Midkiff, and M. Gupta, From Flop to Megaflops: Java for Technical Computing. ACM Trans. on Programming Languages and Systems, 2000. 22(2): p. 265-295. Fitzgerald, R., et al., Marmot: an optimizing compiler for Java. Software: Practice and Experience, 2000. 30(3): p. 199-232. Burke, M., et al. The Jalapeno dynamic optimizing compiler for Java. in Proceedings ACM 1999 Java Grande Conference. 1999. San Francisco, CA, United States: ACM. Guitart, J., et al. Efficient Execution of Parallel Java Applications. in 3rd Annual Workshop on Java for High Performance Computing (part of the 15th ACM International Conference on Supercomputing ICS'01). 2001. Italy. Yariv Aridor, M.F., Avi Teperman. cJVM: a Single System Image of a JVM on a Cluster. in IEEE International Conference on Parallel Processing (ICPP-99). 1999. Hummel, S.F., T. Ngo, and H. Srinivasan, SPMD Programming in Java. 1996, IBM T.J. Watson Research Center. Carpenter, B., et al. MPJ: MPI-like Message Passing for Java. in ACM 1999 Java Grande Conference. 1999. San Francisco. Baker, M. and B. Carpenter, Thoughts on the structure of an MPJ reference implementation. 1999. Mintchev, S. and V. Getov. Towards Portable Message Passing in Java: Binding MPI. in Proceedings of EuroPVM-MPI. 1997: Springer LNCS 1332. Baker, M., et al. mpiJava: An Object-Oriented Java interface to MPI. in First UK Workshop on Java for High Performance Network Computing, Europar '98. 1998. Getov, V., S. Flynn-Hummel, and S. Mintchev., High-Performance Parallel Programming in Java: Exploiting Native Libraries. Concurrency: Practice and Experience, 1998. 10(11-13): p. 863-872. Baldeschwieler, E., R. Blumofe, and E. Brewer. ATLAS: An Infrastructure for Global Computing. in Proceedings of the Seventh ACM SIGOPS European Workshop: Systems Support for Worldwide Applications. 1996. Baratloo, A., et al. Charlotte: Metacomputing on the Web. in Proc. of the 9th International Conference on Parallel and Distributed Computing Systems. 1996. Christiansen, B., et al. Javelin: Internet-Based Parallel Computing Using Java. in ACM Workshop on Java for Science and Engineering Computation. 1997. Forum, M.P.I., MPI: A Message-Passing Interface Standard - Message Passing Interface. 1995, University of Tennessee: Knoxville. Geist, A., et al., PVM: Parallel Virtual Machine, A Users' Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Computation, ed. J. Kowalik. 1994, Cambridge, MA: MIT Press. Forum, M.P.I., MPICH-A Portable Implementation of MPI. 2003, University of Tennessee: Knoxville. Mitra, P., et al. Fast Collective Communication Libraries, Please. in Proceedings of the Intel Supercomputing Users' Group Meeting. 1995. Luecke, G.R., B. Raffin, and J.J. Coyle, Comparing the Communication Performance and Scalability of a Linux and an NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray T3E-600. Journal of Performance Evaluation and Modelling for Computer Systems (PEMCS), 2000. Brim, M. and J. Sommers, Efficient MPI Broadcast Communication and File Distribution. 2002, Computer Sciences Department University of Wisconsin: Madison. Johansson, C., P. Raicevic, and A. Lansner. Reinforcement Learning Based on a Bayesian Confidence Propagating Neural Network. in SAIS-SSLS Joint Workshop. 2003. Center for Applied Autonomous Sensor Systems, Örebro, Sweden. Sandberg, A., et al., A Bayesian attractor network with incremental learning. Network: Computation in Neural Systems, 2002. 13(2): p. 179-194. Lansner, A. and Ö. Ekeberg, A one-layer feedback artificial neural network with a Bayesian learning rule. Int. J. Neural Systems, 1989. 1: p. 77-87. Holst, A. and A. Lansner, A Higher Order Bayesian Neural Network for Classification and Diagnosis, in Applied Decision Technologies: Computational
47.
48. 49.
Learning and Probabilistic Reasoning, A. Gammerman, Editor. 1996, John Wiley & Sons Ltd.: New York. p. 251-260. Kartashov, A., et al., Quality and efficiency of retrieval for Willshaw-like autoassociative networks: III. Willshaw - Potts model. Network: Computation in Neural Systems, 1997. 8(1): p. 71-86. Lansner, A. and A. Holst, A higher order Bayesian neural network with spiking units. Int. J. Neural Systems, 1996. 7(2): p. 115-128. Fusi, S., Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 2002. 87: p. 459-470.