Proving Properties of Neural Networks with Graph Transformations
I. Fischer, University of Erlangen{Nuremberg, Germany,
[email protected] M. Koch, Technical University of Berlin, Germany,
[email protected] M. R. Berthold, University of California, Berkeley, CA, USA,
[email protected]
Abstract | Graph transformations oer a unifying framework to formalize Neural Networks together with their corresponding training algorithms. It is straightforward to describe also topology changing training algorithms with the help of these transformations. One of the bene ts using this formal framework is the support for proving properties of the training algorithms. A training algorithm for Probabilistic Neural Networks is used as an example to prove its termination and correctness on the basis of the corresponding graph rewriting rules. Keywords | Graph Transformations, Formalization, Probabilistic Neural Network
I. Introduction
In [2] graph transformations were proposed as unifying formalism for the representation of the structure of Neural Networks. The framework can easily be used to formalize algorithms required to train and use the network. Graph transformations additionally provide a visual way to construct the network and therefore enable rapid prototyping. Here an example of a training algorithm for Probabilistic Neural Networks [1] will be formalized with the help of graph transformations. Also the training patterns will be graphically represented and all control mechanisms necessary for the training will be encoded in the graphs used. Thus the graph transformation rules can be seen as a new paradigm for rapid prototyping of Neural Nets. Furthermore ways to formalize proofs about the network's behaviour and properties of the training algorithm based on the formalization are provided. This demonstrates that graph transformations are not only useful for specifying algorithms but also oer a solid theoretical background. II. Transforming Graphs
In the following formalization the Probabilistic Neural Network [4] and the patterns are modeled with the help of a labeled graph whereas the training algorithm is modeled by graph rewriting rules. This methodology was proposed in [2]. Additional information on graph transformations can be found in [3]. A labeled graph G consists of a set of edges GE , a set of nodes (vertices) GV and two mappings sG ; tG : GE ! GV that specify source and target node for each edge. Additionally nodes and edges can be labeled several times with elements of dierent label sets LEi ; LVj (0 i n; 0 j m). This is done with the help of mappings lGEi : GE ! LEi ; lGVj : GV ! LVj . A morphism (fE ; fV ; fLEi ; fLVj ) between two graphs G and H labeled with the same alphabets consists of a map-
ping fE between the edges, fV between the nodes and fLk (k 2 fEi; Vj g) between the label sets of the graph. A morphism must ensure commutativity with the mappings used for the graph de nition, e.g. sH fE = fV sG ; fLEi lGEi = lHEi fE . The same must hold for the labeling of nodes and the target mappings of the graph. A graph can be modi ed by graph rewriting rules. A l graph r rewriting rule consists of two morphisms K ! a L; K ! R, a set of morphisms AG = fL !i Ai j1 i ng, called graphical conditions, and a set AL of boolean expressions, called label conditions. In the upper half of Figure 1 a graph rewriting rule (l : K ! L; r : K ! R) with one graphical (a : L ! A) and one label condition (x < y) is shown. Here nodes are labeled with elements from the term algebra for integers, whereas edges are not labeled. The terms x; y; z are variables. In K and C we use ? as an extra element of our label set to indicate that labels have to be changed. A L K R x y y ? r y a l x
z ^x < y x
m
?
z
6 g =
3 1
2
G
g0
l0
? ?
?
x+y
g00
r0
?
C
xy
1 1
4
H
Fig. 1. An example of a rewriting rule
When applying a rewriting rule to a graph G rst an embedding g of the left hand side L into G must be sought. By the embedding the values of the variables are substituted by integers. In our example the variables x; y; z are replaced with the values 1; 3; 2. Next the label conditions have to be checked. The embedding satis es a label condition, if its evaluation is true under the variable assignment of g. In Figure 1 x < y denotes that the value substituted for x must be smaller than the value substituted for y, which is true in our example. Then the graphical conditions ai have to be checked. The embedding g satis es ai if there is no embedding m : Ai ! G such that m ai = g, i.e. ai speci es forbidden graphical structures. In Figure 1 there is a graphical condition forbidding an edge between the node labeled x and the
Appeared in Proceedings of the IEEE International Joint Conference on Neural Networks, Anchorage (Alaska), May 1998.
// reset weights: node labeled y. In our example the embedding satis es the (1) FORALL hidden neurons (class k, #i) DO graphical condition. In the following these negative graphwik = 0:0 ical conditions will be written with :9 if the morphisms ai ENDFOR are obvious from the graphs. // train one complete epoch (2) FORALL training pattern (~x; k) DO: If all labels as well as all graphical conditions are satis ed (3) IF 9pki : pki (~x) ^ by g, the rewriting rule can be applied. Otherwise a new 81 j mk ; j 6= i : pkj (~x) pki (~x) THEN embedding must be sought, if possible. k (4) wi + = 1 We apply the rule on G by deleting the left hand side L ELSE (5) // \commit": introduce new neuron from G except K which contains nodes and edges that are mk + = 1 necessary to insert R. The result is the context graph C. (6) ~mk k = ~x Theoretically C is constructed as G ? g(L ? l(K)). Then R k =1 (7) wm can be inserted in C according to the gluing graph K. The k k = max (8) m k result H is calculated as C +(R ? r(K)). The new labels of ENDIF // \shrink": adjust con icting neurons graph H are obtained by evaluating the expressions given in (9) FORALL l 6= k; 1 j ml DO R under the variable assignment given by the embedding g. IF plj (~xs ) > ? THEN In Figure 1 the nodes in the result graph H are now labeled (jl ) ln plj (~x) with the results of addition, product and power resp. under jl = ln ? the assignment x = 1; y = 3; z = 2. ENDFOR It is also useful to have graph parts where not only one embedding is taken to apply the rule but all possible emFig. 2. The DDA algorithm for one epoch beddings. In this case all overlapping embeddings must not relabel, to prevent dierent new labels for dierent embed- the training phase. In the next paragraph, a Probabilistic Neural Network dings for a node or an edge. In the following graph parts and the set of training patterns will be described as lalike this will be marked with an extra circle in the rule's beled graphs and the entire DDA algorithm with the help left hand side. of graph rewriting rules. Afterwards we will use the established formalism to formally prove termination of the III. Dynamic Decay Adjustment To quickly train Probabilistic Neural Networks, the Dy- algorithm for a nite training set. namic Decay Adjustment (DDA) algorithm was introduced IV. DDA as Graph Rewriting Rules in [1]. This constructive training methodology introduces First a Probabilistic Neural Network [4] and the set of new neurons in the hidden layer whenever required to cor- training patterns are modeled using labeled graphs. An rectly classify a new pattern and adjusts the standard de? o ? 5 viation of existing neurons to minimize the risk of misclas; ; e t 5 si cation. The code to perform training for one epoch is cv shown in Figure 2, where pki indicates neuron i of class k i (1 i mk , where mk speci es the number of neurons ? ; ; ; ; ? k of class k), wi the corresponding weight (which models a f local (unnormalized) a priori probability of class k), ~ki the ; ; ; ; ? ? 4 t center vector and ik the individual standard deviation. For nt more details on Probabilistic Neural Networks see [4]. The i algorithm operates as follows: k before training an epoch, all weights wi must be set to ? i ? i ? zero to avoid accumulation of duplicate information about 4 t the training patterns (1); nt next all training patterns are presented to the network i (2); if the new pattern is classi ed correctly (3), the weight of the neuron with the highest activation is increased (4), otherwise a new hidden neuron is introduced (5); having Fig. 3. An Snapshot Graph of the Neural Net and the Training Patterns the new pattern as its center (6), a weight equal to 1 (7), and an initial radius of max (8). example graph is shown in Figure 3. This graph consists of the last step shrinks all neurons of con icting classes if one connected graph modeling the actual net (on the right their activations are too high for this speci c pattern (9); hand side) and of a set of other connected graphs, one that is, the standard deviation jl of a neuron of con icting for each training pattern (on the left hand side). For the class l is reduced if its activation is greater than ? . net itself the input and output neurons are modeled with After only a few epochs (for practical applications, approx- nodes connected via edges labeled i (input) and o (output). imately ve) the network architecture settles (no new com- Each node is labeled with a real number for the actual mitment or shrinking), clearly indicating the completion of value of the neuron and a symbol (in the picture ?) to +
2
0
1
1
1
1
0
1
5
5
1
0
3
8
15
54
4
3
5
2
06
04
08
05
50
70
30
60
4
4
4
9
7
7
3
3
7
handle the information ow through the net. The neurons in the hidden layer are split into two nodes. The rst node contains the input activation and the second one the output activation of the neuron. These nodes are labeled with control symbols (?) and real numbers as well1. The nodes in the hidden layer are connected by an edge labeled with the individual standard deviation ik (a real number). The edges from the input to the hidden layer are labeled each with an element of the center vector ~ki (also real numbers). An edge from a neuron in the hidden layer to an output neuron is labeled with the corresponding weight wik . One training pattern t = (~i;~o) with input ~i and output ~o is modeled through a connected graph. A condition node contains information for each pattern set about the actual state of training in this epoch. For instance, in the snapshot in Figure 3 the algorithm is currently in epoch 5 because the e-node has label 5. The condition node of the uppermost pattern contains label cv indicating that this pattern has triggered a cover in this epoch. The label 5 indicates that this pattern was already used for training in epoch 5. Since the other two training patterns have label 4 they were not yet used for training in epoch 5. This is also indicated by label nt for not trained. The components of the vectors ~i;~o are distributed over several nodes where the input and target (output) nodes are connected through edges. Additionally there are edges from the condition node to the input and target nodes labeled with i and t to identify the input and the target. In Figure 3 there are also two single nodes, which are used to store global status information. The rst is labeled f and contains information about the state of the net. f signals that the net is free at the moment, i.e. a new training pattern can be passed through the net. The second node is labeled with e and a positive integer. This node indicates the number of epochs already trained. It is later needed for the proof of termination and convergence. For the sake of simplicity, the rules for the DDA algorithm are constructed for the special case where the input pattern consists of three and the target pattern of two values; that is, only 3{dimensional input patterns belonging to two dierent classes are considered. It is, however, straightforward to adapt the rules to arbitrary training patterns as shown in [2]. In the next section the entire training algorithm is described using graph transformations.
?
o2
x ? 1 t o1 i i1 nt
f
i3
i2
o
y2
x1
i
x2
?
)
?
y1
?
x3
i
e
x
?
=
t
T t o1 i i1 C
o
i
T o2 i2 C
i
i3 C
e
x
Fig. 4. Inserting a pattern into the net
process. The control symbols of the input nodes are now C (Connection) and the control symbols of the output nodes T (Target). The nodes in the rules are labeled with variables and terms over these variables. When applying such a rule to a net the variables are substituted through the real numbers of the net and calculations are done with these real numbers. Now the input pattern has to be propagated to the hidden layer (see Figure 5). The control symbol of the input node in the hidden layer switches to N indicating that this neuron is now activated. In the rule shown in Figure 6 (in the following we will use the abbreviation \rule 6") the output activation of a neuron in the hidden layer is calculated and the control symbol of the output node of the hidden neuron is set to C. Finally the actual training can start. First either a commit (rule 8) or a cover (rule 7) is applied. The cover rule is executed if the activation of at least one neuron in the hidden layer is sucient; that is, its activation is greater or equal to + (see also (3) in Figure 2). To indicate that a cover was executed the control node of the net is changed from t to cv (cover). The application conditions ensure that the hidden neuron with the biggest activation is taken, and that all activations have already been computed (no ? exists). After the application of the rule, the corresponding node is labeled with D (done) to indicate that for this training epoch the neuron was used. If it is not possible to apply the cover rule, a new hidden neuron (consisting of two nodes in the graph) must be inserted as shown in rule 8 (this step corresponds to (5){(8) in Figure 2). The application conditions ensure that rule 7 is not applicable (top) and that the computation phase of the training is already nished; that is, no ?-symbol exists (bottom). A. One training step As in rule 7 the corresponding hidden neuron is marked In the following rules we depict only the graphs L and with D and the control node is now labeled cm (commit). R and omit the gluing graph K which can easily be deterFinally the deviation of con icting neurons (d > ? ) mined from the context. must be shrunk (rule 9, corresponding to (9) in Figure 2). First one training step for one input-output pattern is This cannot be done before either the commit or cover rule described. Figure 4 illustrates, how one training pattern N ? x i1 ? 1 2 i 2 ? 2 2 that was not yet used in the actual epoch (upper left cori 3 ? 3 2 ) ner) is inserted in the net (upper right corner). The pattern 1 2 3 1 2 3 itself is deleted and the corresponding values are copied to i3 i2 i3 i1 i2 i1 the input and output neurons of the net (bottom). The C C C C C C condition node f is set to t indicating that training is in (
=
i
1
In Figure 3 not all control symbols are included.
i
i
) +( +( )
i
Fig. 5. Calculating the input to the hidden layer
)
C
?
x y N
)
y2
1
t
T
T
+ w d =) d C
cv
cm
1
)
1
^d1 > d w + 1 :9 w1 w d1 d d C C D T 1
1
:9
w1
d1
?
Fig. 7. Covering of a training set
have been applied (marker cm or cv). After the application of rule 9 this control node is labeled s and the corresponding node of the neuron is marked D. For the hidden neurons where none of the rules 7 - 9 was applied, the control symbol D has to be inserted (two rules in 10). When all hidden neurons are labeled with D, training is nished and pattern extraction and net cleaning can start (rule 11). In this rule the pattern is extracted from the net. The training pattern is again moved into an own graph where the condition node now indicates the event which happened during this training phase. The real number indicating the epoch is set to the actual number given by the node labeled with e, therefore this training pattern cannot be used again for training during the current epoch. The condition node of the net indicates via f that the net is free again. Note, that nodes with double circles indicate that all possible embeddings for these nodes have to be taken instead of just one. So all D and N-labels in the entire graph are changed back to ? in one step. Afterwards the next training pattern can be inserted into the net using rule 4. B. Training over several epochs
When no training pattern with a condition node labeled
nt can be found, all patterns were actually used for train-
i1 C
i
i2 C
i
1
1
=
Fig. 6. Calculating the activation of a neuron in the hidden layer T
1
t
y N
=
T
T
T
e? 2
1
D
0
N
i1 i2 i1 i2 C C
i3 C
i
i
:9 max i3 i3 C
+ w d d C T 1
:9 w d
?
Fig. 8. Inserting a new neuron cm _ cv _ s
T
T
0
0
w C d
N v
w d =>)? ? D r v N
s 2 lnd ln ?
Fig. 9. Shrinking the radius of a hidden neuron
estimation of the maximal number of epochs is derived and afterwards the termination of one epoch is proven. A. Correctness and Number of Epochs
The set of all training pattern t = (~i;~o) is denoted by T. After every epoch i the set T is split in three disjunctive subsets T = Tfi [ Tpi [ Tsi . After applying rules 4{6 (i.e. computing the activation for each neuron) for a pattern, it can be classi ed as: Rule 8 is applicable; that is, pattern t is not correctly classi ed. Then t belongs to Tfi . Rule 7 is applicable and d = 1; that is, pattern t is correctly classi ed and is the center of an existing neuron (t 2 Tpi ). Rule 7 is applicable with d < 1; that is, pattern t is correctly classi ed but is not the center of a neuron (t 2 Tsi ). De nition V.1 (Correctness criterion) The set T of training patterns is classi ed correctly i there is an epoch i 2 N with jTfj j = 0 for all j i, j 2 N. 4 Proposition V.2 (Monotony of shrinking) The expansion of a hidden neuron is monotonous falling. 4 Proof: The only rule altering expansions of neurons is rule 9. But, it is only applied if d > ? . Let be the deviation before shrinking wrt. a training pattern t. Then the value after shrinking is
ing, indicating the end of this epoch. Depending on the information nodes of the patterns the algorithm terminates. When there are only pattern with a condition node cv, q 2 ln d q ln d training is over, since the network did not change anymore 0 = during the last epoch (besides simple cover operations). ln ? = ln ? Then no rule can be applied anymore. Otherwise the net and the training pattern are cleaned up again and a new To prove 0 < (for ; 0 > 0) it is sucient to show that q ln d epoch can start. This is done in rule 12. First all training pattern condition nodes are reset to nt (not trained). ln ? < 1 Then all weights w must be set to 0. Finally the number i.e. (using ? 2 [0; 1] and thus ln? < 0) of epoches is increased by 1. ln d ? ? V. Proving Termination and Convergence ln ? < 1 , ln d > ln , d > In the following the termination of the graph transfor- which was the assumption for the shrink to be executed. 2 mation system described in section IV is shown. First an
T
T
0
w d C cm _ cv _ s
x
0
d ?
w d D cm _ cv _ s
)
=
T
cm _ cv _ s x+1
1
)
w d C cm _ cv _ s
)
=
e
:9 nt
? x
nt
0
f
w d D cm _ cv _ s
=
x w y ? z ?
f
T
1
?
cm _ s
e
y ? z ?
nt
Fig. 12. Starting a new epoch
Proposition V.4 For all t 2 T, i 2 N holds t 2 Tpi ) t 2 Tpj 8j 2 N with j i
Fig. 10. Handling the rest of the hidden neurons
Corollary V.3
1. The deviation of a hidden neuron will be shrunk in at most one epoch for each training pattern. 4 2. If a neuron is not shrunk for a training pattern t in epoch Proof: Let t = (~i;~o) 2 Tpi . Due to the grammar there is i, it will be never shrunk for t in later epochs j > i. 4 a hidden neuron with incoming edges from the input layer labeled with the center ~ = ~i, so that y(t) = 0 holds after Proof: application of rule 5. Due to rule 6 the output value of the 1. Let t 2 T lead to a shrink of the neuron in epoch i neuron is d = e? y22 = e0 = 1. Therefore, t 2 T j holds for p resulting in a new deviation i . For this value all epochs j i. 2 e
? y2 (2t)
= ?
Corollary V.5 For all i 2 N holds jTpi+1j jTpi j. 4 holds (y(t) indicates the output activation of the hidden Proof: Due to prop. V.4 jTpi+1 j < jTpi j is not possible. 2 i
neuron generated by input t). Then for every later epoch j (j > i) the following is also valid (using prop. V.2): Proposition V.6 (Characterization of correctness) holds for all i 2 N 2 2 ? y (2t) prop: V:2 ? y(2t) ? dj = e = e jT ij = jT i+1 j = jT i+2 j , jT j j = 0 8j i j
i
p
p
p
It
f
and this neuron will not be shrunk again. 4 2. If a neuron with deviation i is not shrunk for t in epoch i, it holds Proof: ): jTfi j = 0 holds due to jTpij = jTpi+1j, i.e. all patterns are covered in epoch i. Otherwise there would ? y2(i2t) be a t 2 Tfi which has to be committed in epoch i + 1 ? di = e : by means of rule commit resulting in jTpij > jTpi+1 j. Due The claim follows analogous to 1. to jTpi+1j = jTpi+2 j, jTfi+1 j = 0 holds for the same reasons. 2 Due to corollary V.3 the training patterns having triggered shrinks in epoch i and i + 1 do not lead to shrinks in later T xe T T o epochs j > i + 1. Also patterns which did not trigger a i o o2 cv _ cm _ s t 1 shrink in epoch i or i j+ 1 will not trigger a shrink in later :9 w i i1 i2 i3 epochs. Therefore, jTf j = 0 holds for all j i. d C C C C (: Since training patterns pertaining to Tfj are only comD N ) mitted and Tfj is empty for epochs j i, the set Tpj stays xe constant. 2 ? ? o
i
i
=
x o cv _ cm _ s t 1 i i1
?
?
o2 i2
i3
f
o1
o
o2
i1
i
i2
?
?
i
Fig. 11. Extracting the result and cleaning the net
i3
?
Theorem V.7 (Correctness of DDA) Let T be a nite training set with jT j = m. Then T is classi ed correct after maximal 2m ? 1 epochs. 4 Proof: In the worst case every training pattern is covered
by an neuron of its own; that is, after termination T = Tpi .
According to proposition V.6 and corollary V.5 the worst course of training then looks as follows: 0 = jTp0j < jTp1j = jTp2j < jTp3j = jTp4j < < jTp2m?1j Therefore, after 2m ? 1 epochs all training patterns are committed and jTf2m?1j = 0 holds. According to corollary V.5 jTfj j = 0 holds for all j 2m ? 1. 2 This means that the integer the node e is labeled with can reach at most 2m ? 1.
an existing training algorithm. It was demonstrated how Graph Rewriting Systems oer a graphical and intuitive way to specify training algorithms for Neural Networks and how the underlying theory of Graph Rewriting Systems can be used to prove properties of the training algorithm.
References [1] Michael R. Berthold and Jay Diamond: Constructive Training of Probabilistic Neural Networks, in Neurocomputing, Elsevier Publisher, to appear. [2] Michael R. Berthold and Ingrid Fischer: Modelling Neural Networks with Graph Transformation Systems, Proceedings of the IEEE International Joint Conference on Neural Networks, 1, B. Termination of the Training pp. 275{280, 1997. Rozenberg (ed): Handbook on Graph Grammars: FoundaWith the help of the graph modeling the net, the pattern, [3] G. tions, 1, World Scienti c, 1997. and the rules transforming the graph it is easy to show [4] Donald F. Specht: ProbabilisticNeural Networks, in Neural Networks, 3, pp. 109{118, 1990. that one training step, one epoch, and the entire training
is terminating. First a well-founded order has to be de ned on the graphs. We will use the lexicographic order (jej; jntj; j?j; jC j; jN j; jDj; jT j), where: jej is a natural number denoting the number of epochs that can possibly be done. The maximum is 2m ? 1. Each time a new epoch starts this number is decreased (jej = (2m ? 1)? label of node e). jntj is the number of condition nodes of pattern sets not yet trained with (nodes labeled nt plus one set that is actually trained with). j?j; jC j; jN j; jDj; jT j are the numbers of nodes labeled ?; C; N; D; T. As an order on these words we use < with 0 as the lower bound. This means that (0; 0; 0; 0; 0;0;0) is the lower bound of the complete lexicographic order. The order is initialized with (2m; jT j; i+o; 0; 0; 0;0) where i denotes the number of input and o denotes the number of output nodes. If all rules follow this order; that is, the right hand side R is smaller than the left hand side L and this also holds for the graph this rule is applied to, then the graph rewriting system will terminate as the order has a lower bound. The following table denotes how the rewriting rules described in section IV change this order (h is the number of hidden neurons): Rule Fig. 4 Fig. 5 Fig. 6 Fig. 7 Fig. 8 Fig. 9 Fig. 10 Fig. 11
Order after application (jej; jntj; j?j ? i ? o; jC j + i; jN j; jF j; jT j + o) (jej; jntj; j?j ? 1; jC j; jN j + 1; jF j; jT j) (jej; jntj; j?j ? 1; jC j + 1; jN j; jF j; jT j) (jej; jntj; j?j; jC j ? 1; jN j; jF j + 1; jT j) (jej; jntj; j?j; jC j ? 1; jN j; jF j + 1; jT j) (jej; jntj; j?j; jC j ? 1; jN j; jF j + 1; jT j) (jej; jntj; j?j; jC j ? 1; jN j; jF j + 1; jT j) (jej; jntj ? 1; i + o + 2h; jC j ? h ? i ; jN j; jF j ? h; jT j ? o) Fig. 12 (jej ? 1; jntj; j?j; jC j; jN j; jF j; jT j)
Since the number of epochs has an upper bound this order will eventually terminate. And therefore the entire training will terminate and will be correct as stated in theorem V.7. VI. Conclusions
In this paper a framework to formalize Neural Networks together with their corresponding training algorithms using Graph Rewriting Systems was used to prove properties of