IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
81
Generalized Haar DWT and Transformations Between Decision Trees and Neural Networks Rory Mulvaney and Dhananjay S. Phatak
Abstract—The core contribution of this paper is a three-fold improvement of the Haar discrete wavelet transform (DWT). It is modified to efficiently transform a multiclass- (rather than numerical-) valued function over a multidimensional (rather than low dimensional) domain, or transform a multiclass-valued decision tree into another useful representation. We prove that this multidimensional, multiclass DWT uses dynamic programming to minimize (within its framework) the number of nontrivial wavelet coefficients needed to summarize a training set or decision tree. It is a spatially localized algorithm that takes linear time in the number of training samples, after a sort. Convergence of the DWT to benchmark training sets seems to degrade with rising dimension in this test of high dimensional wavelets, which have been seen as difficult to implement. This multiclass multidimensional DWT has tightly coupled applications from learning “dyadic” decision trees directly from training data, rebalancing or converting preexisting decision trees to fixed depth boolean or threshold neural networks (in effect parallelizing the evaluation of the trees), or learning rule/exception sets represented as a new form of tree called an “E-tree,” which could greatly help interpretation/visualization of a dataset. Index Terms—Decision tree rebalancing, fault tolerance, Haar wavelet, multiclass, rule generation, threshold network.
learning tools represent a k-class function as a set of k 2-class functions, even though this makes their interpretation as a whole (by a single ruleset, for example) more difficult. This paper shows how to find a compact hierarchical set of rules and exceptions (see Sections III-A and VI-A that describe a multiclass function by introducing some clean generalizations of the Haar Discrete Wavelet Transform (DWT) algorithm that allow its application directly to a multiclass decision tree. The resulting hierarchical “E-Tree” ruleset would be a navigable multidimensional map, with statistics for each rule that, for example, a doctor would find very helpful for making inferences about a disease. An additional benefit of a ruleset is that the rules can also be evaluated very quickly in parallel by a 3-layer threshold network representation of the rules. Real time pattern recognition is vital for quickly identifying security or quality control problems at a port or factory, and in the well-documented case of handwritten zip code recognition, where nearest neighbor methods are too slow [1]. A. Generalized Application of Haar DWT
I. INTRODUCTION
H
UMAN INTERPRETATION of pattern recognition classifier functions becomes difficult using standard visualization with more than two or three variables, leading to a need for alternate methods to explore and understand this multivariate data. It seems an intuitive solution to this problem would be to explore the data as a kind of multidimensional map marked with application-dependent terms, where preferably the level of detail could be reduced in a prioritized way. It seems a good way to accomplish this is with a tree-structured set of rules and exceptions. Complicating the extraction of rules from a machine-learned class-valued function, somewhat unexpectedly, is the use of multiclass-valued functions, which are defined as functions taking one of three or more nonnumeric class values at a point. This problem stems from the fact that the usual numerical trick of approximating two values by their average doesn’t apply to class values. For example, even if the (class-valued) vowel sounds “a,” “e,” and “i,” are given numeric labels 1, 2, and 3, it is nonsense to expect the “average” of “a” and “i” to be “e.” Therefore, some of the most powerful pattern recognition
Manuscript received September 3, 2004; revised April 5, 2005. This work was supported in part by the NSF under Grants ECS-9875705 and ECS-0196362. The authors are with the Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County (UMBC), Baltimore, MD 21250 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNN.2005.860830
The Haar DWT is generalized to work on three forms of data, yielding several new advantageous uses described in the paragraphs below. The multiclass, multidimensional Haar DWT can be applied to data in the form of: 1) a “dense” vector of values in as in Fig. 4, 2) a sparse set of point-value pairs on as in the experiments in Section V-A and prior work in [2], or 3) a decision tree as in Section VI. In all forms of the DWT, the resulting coefficients are a hierarchical set of “convex-region” rules and exceptions. The rules output by the multiclass Haar DWT have several uses, most obviously as a human-interpretable hierarchical rule/exception set in the form of an E-tree, as discussed in Section VI-A. The generation of these rules appears more elegant than the generation methods used by the C4.5 program [3], which doesn’t seem to form exceptions. A second use of rulesets is their parallel evaluation in the form of a threshold or Boolean network, in effect speeding the evaluation of the decision tree (see Section III). In [4], a method to construct similar neural networks from decision trees was sketched (not very explicitly). There and in [5] generalization error was further improved by using sigmoidal threshold functions and gradient descent. For simplicity, we ignore potential gains from optimization along gradients, but give a more explicit and efficient network structure generation algorithm. Third, the ruleset representation is shown to help rebalance existing decision trees in Section VI-B, by recursively balancing the number of rules to be evaluated on each side of a tree. As
1045-9227/$20.00 © 2006 IEEE
82
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
noted in a Ph.D. thesis on decision trees by Murthy [6], there is very little published on rebalancing decision trees. Murthy borrowed the concept of “rotations” for balancing binary search trees [7], which are used to speed up the search for a given key. Using only simple parent-child local rotations on actual decision trees, Murthy was able to get nice results with depth reduction. Section VI-B illustrates how rulesets enable a more sophisticated rebalancing analysis. Last, when the DWT is applied directly to a sparse set of point-value pairs, the resulting rules can be easily converted to a “forced-split” dyadic decision tree. Dyadic decision trees were introduced recently by Scott and Nowak [8], as trees whose decision planes bisect their domain parallel to a coordinate axis. The decision planes down a path in a forced-split tree cycle through the coordinates, while free-split trees allow the coordinate to be chosen independently at each node. To augment the work in [2], a proof is given in Section IV that the Haar transform minimizes the number of nontrivial coefficients (rules) within the forced-split framework.
coefficients of a multidimensional real domain, have been recently studied by Scott and Nowak [8], [17], and shown to offer near optimal asymptotic convergence on worst-case functions. Little has been published on the use of spectral analysis of general decision tree structures on more general domains such as , as opposed to the boolean domain with a canonical ordering of boolean function values. Walsh-Hadamard transforms of decision trees are used by Park and Kargupta [18], [19] for combining spectra of an ensemble of decision trees, possibly learned at multiple sites, as well as for compression. The multiclass Haar coefficients of decision trees, as discussed in Section VI, have an intuitive interpretation as regions of the domain, offering several new capabilities such as rebalancing, multiclass (nonnumerical) value representation, and storage of histograms.
B. Relation to Prior Work on Spectral Analysis of Classification Functions The simple binary recursive nature of the Haar DWT transform allows it to remain applicable (along with the Walsh-Hadamard transform [9]) all the way down to boolean function representation minimization. Falkowski and Chang found the boolean analog to the Haar DWT - decision tree relationship by finding forward and inverse transforms between Haar spectra and Ordered Boolean Decision Diagrams [10]. They later used the Paired Haar Transform [11], [12] to approximate an incompletely specified Boolean function and also to assist in the top-down minimization of its Free Boolean Decision Diagram (FBDD) representation in [13], by deciding the optimal variable to consider at each decision node, in much the same way as a top-down decision tree learner works. Since an FBDD is basically the same as a free-split dyadic decision tree for boolean functions (with additional features to share subtrees), decision tree learning is closely related to boolean function representation. It appears that these greedy top-down methods can be further improved by another application of the Haar DWT to the learned tree or FBDD for rebalancing analysis, as sketched in Section IV-B. Gunther and Drechsler [14] give an exact algorithm, along with heuristics for use with many variables, to find the minimum FBDD representation of a completely specified Boolean function. Haar spectral diagrams, Walsh-Hadamard decision diagrams and other discrete spectral transforms are discussed at length in [15], [16], mostly in the context of boolean functions. Though Walsh-Hadamard and Fourier representations probably offer somewhat more compression ability, it seems Haar transforms are more naturally suited to tree-structured evaluation and representation than Walsh-Hadamard transforms, since evaluation of a function at a given point from its Haar coefficients only requires operations with coefficients using the binary tree structure of the coefficients, whereas evaluation from Walsh-Hadamard coefficients requires access to all coefficients. Dyadic decision trees, which essentially represent the Haar
C. Three-Layer Networks Allow Direct Encoding of Rules and Exceptions 1) Two-Layer Networks Awkward: The universal approximation power of single-hidden-layer multilayer perceptrons, which are referred to here as two-layer networks, was proven (in one version) by Hecht-Nielsen in 1989 [20] using the simple observation that any one-dimensional (1-D) function (therefore the sine and cosine functions) may be approximated to arbitrarily small error with a single hidden layer of nodes (using sigmoid or threshold transfer functions), and then the multidimensional Fourier transform can be applied to extend . Of this capability to multidimensional functions course in practice, constructing a neural network for classification functions this way would require many hidden nodes, because many modes of the multidimensional Fourier series may be required, and each of these modes (a sine or cosine) requires many hidden nodes (with threshold or sigmoid transfer functions) to approximate accurately. 2) Three-Layer Networks: Intuitive and Efficient: Because of the just-mentioned large constant-multiple overhead cost of using Fourier modes, and the fact that Fourier spectra lose much of their approximating power with nonnumerical multiclass functions (if class A, B, and C are represented by 1, 2, and 3, there is no reason that the “average” of A and C should be B), we turned instead to a multidimensional Haar discrete wavelet transform (DWT) [21], since it is piecewise constant and therefore much easier to represent class values using only a small number of simple threshold nodes per wavelet coefficient. (We make several modifications to the Haar DWT, including one that allows it to work with class-valued functions, rather than only numerical-valued functions.) Three-layer networks are also intuitive since each second layer node corresponds directly to a convex region of the domain, allowing additional information such as histograms to be stored for each region. An additional layer, however, is required for this since Haar coefficients are, unlike Fourier coefficients, multidimensional. So the Haar approximation of the function will actually be represented by a two-hidden-layer threshold network (which is referred to here as a three-layer network), though sigmoidal units could be used to facilitate postprocessing optimization along the gradient defined an objective function.
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
83
Fig. 1. Haar DWT example. The elements in the original vector on the top row are paired off, with their averages and halved-differences placed in the next row in the left and right arrays, respectively. Each row is also multiplied by the constant factor on the left. Only the high-pass coefficients (in the right array of each row) and the final low-pass coefficient in the bottom row need to be retained to reconstruct the original vector.
It is shown how (in Section III-B) to directly translate the multiclass Haar DWT coefficients into the nodes of a three layer threshold neural net, or even into three compact levels of boolean logic, ignoring inverters (in Section III-D), after the first thresholding layer converts inputs to boolean values. The relative ease and intuition of using threelevels of boolean logic is somewhat supported by theory [22], [23] that says the jump from 2 to 3 levels of boolean logic can dramatically reduce the number of gates required, while little further improvement is possible (in the worst case) by further increasing the number of layers. More specifically, Sasao [23] used combinatorial methods to show (approximately stated) that the log of the number of gates required to realize the worst possible n-variable boolean function in three levels of unlimited fan-in logic (using AND, OR, and NOT gates) is at most about half of that required to realize “almost any” -variable Boolean function in two levels, and about the same as that required for almost any function in 4 or more levels. However, see [24] for some specific concrete examples of 4 layer networks that require fewer nodes than their three layer equivalents. Intuitively, a big reason for the advantage of three-level networks seems to stem from the fact that they allow exceptions (and exceptions to the exceptions ) to implicants, or convex factor of savings (see Secregions of space, leading to a tion III-A and Fig. 5) in the number of convex polytopic regions or conjunctive implicants needed to represent certain functions. D. Problem Definition and Convergence Because it is well-known that decision trees (DTs) are relatively difficult to regularize and guard against overfitting (see [25]), we select a problem definition for which the optimal solution is a decision tree, and find that regularization becomes easier for this alternate problem. This problem definition, playing purely to the strength of DTs, is to find a regularized, fast-evaluating approximation with manageable storage complexity of a known (slow-evaluating) class-valued function. Regularization for class-valued functions, approximately stated, is a measure of the maximum distance the approximated decision boundary strays from the true decision boundary, so that regularization methods focus on uniform convergence to the correct decision surface. Applying this to decision trees, Section V-A1 briefly suggests that to regularize to a known function, a constructive postprocessing method with adaptive
sampling might be used rather than pruning to more precisely guide this uniform convergence. The convergence must also be efficient to avoid an explosion in storage complexity, so that doubling the number of decision planes improves the convergence by as large a factor as possible. Actual methods for efficient adaptive sampling for constructive regularization are a topic outside the scope of this work; instead, we simply use multidimensional pattern recognition benchmark training sets to train a classifier using the multiclass Haar DWT, and measure convergence via the generalization error on a test set without any pruning or regularization. This is useful, since the implementation of multidimensional DWTs has been seen as difficult [26], so our tests are likely (to our knowledge) one of the first empirical tests (see also [8]) of the convergence properties of higher-dimensional wavelets over . II. MULTICLASS MODIFIED HAAR DWT A. 1-D Haar DWT The (orthonormal) Haar DWT of order N (a power of 2) is on a vector defined, as from [27], by the action of the matrix , so that , where is recursively defined with , and is the identity tensor products in (1), and matrix of order N/2 (1) Though easily defined by the above matrix-vector multiplication, the structure of allows the transform to be performed time using a simple algorithm described next. in The Haar DWT can be easily understood by referring to Fig. 1. The top row is the data being transformed, and the resulting wavelet coefficients are in the bottom rows of each column. The coefficient in the far lower-left box is not a “wavelet” coefficient, but rather the residual “scaling” coefficient. Starting at the top row, elements are paired off (indicated by the circles). The four average values (multiplied by a constant factor) of these four pairs are placed in the second row in the left box. The “halved differences” of these four pairs are placed in the right box in the second row. These are the finest scale wavelet coefficients. The values in the left box represent an approximation of the original data (from a “low-pass” filter), while values in the right box are the “details” from a “high (frequency) pass” filter. It is easy to see how to reconstruct the
84
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
Fig. 3. Serialization of data in a 2-D matrix.
Fig. 2. Multidimensional grouping progression.
original data from these means and differences. This procedure is performed recursively on the left half (the low-pass coefficients) until the data has been reduced to the desired number of low-pass coefficients, in this case, the minimum, 1. For more background on wavelets, the reader may refer to [21]. B. Scalable Haar DWT for Arbitrarily High Dimensionalities There are actually many different ways of performing a multidimensional wavelet transform. In a primer on wavelets [21], two methods for performing multidimensional Haar DWTs are discussed. Regarding the actual shape and form of the multidimensional wavelets, we basically use the second method, which they refer to as the “nonstandard method.” The operation of this transform is simple. Fig. 2 shows the two dimensional analog of low-pass coefficient pairings (the circled pairs) to compute the “means and differences” in four consecutive levels of the algorithm. At each level, the algorithm simply alternates between pairing neighbors in the horizontal direction and pairing on the vertical direction. In the progression of figures in Fig. 2, each cell is labeled with a number indicating the serialized index of each two-dimensional (2-D) data point. Notice that this is not the usual row-major ordering of data that is typically used in a -dimensional matrix. Also notice that every pairing joins cells with consecutive indices (also as it was in the 1-D transform in Fig. 1). Thus, this 2-D transform can be more easily performed as a 1-D transform on the data sorted in this manner. It turns out that this ordering of the data generalizes to any dimensionality, so that any d-dimensional Haar DWT may be performed as a 1-D Haar DWT. The ordering is also known as a Peano curve or z-ordering (a “z” shape is recursively traced in two dimensions as in Fig. 3) [28], [29]. This ordering can be simply described as the ordering induced by interleaving the bits of a point’s d-dimensional coordinates into a single coorrepresents the most significant bit dinate. For example, if represents the least significant bit, of the x-coordinate, and is the merged, interleaved index for
a three dimensional point. See [30] for the Matlab comparison function used in sorting the data in this localized fashion. To visualize this ordering of the data in two dimensions, refer to Fig. 3. The numbers in the figure represent the serialized indices of the elements in the corresponding blocks of the matrix. It has a natural block-structured recursive hierarchy that corresponds nicely with the bit pattern of the serialized index: the bits of the serialized index tell which path to take through a binary tree whose leaves are the data elements of the matrix. In the 2-D example of the figure, the most significant bit of the index tells whether the data lies in the upper or lower half of the matrix, the second bit tells whether it lies in the right or left half of that half, and so on. The z-ordering has the valuable property of being a “localized” ordering; that is, points that are near each other in the multidimensional space also tend to be near one another in the serialization. In general then, the multidimensional Haar DWT becomes piecewise constant over “dyadic hyperrectangles,” which can be specified by a -dimensional integer address and the number of significant bits. In mathematically rigorous terminology as from [8], the wavelet transform learns a function that is conof a cyclic dyadic partition (CDP) stant on each of the parts of the real domain , where is tree-structured and depth-first ordered. In this partition, is a hyperrectangle with sides parallel to coordinate axes. is a cell at depth in the tree, let and be If along its midpoint on hyperrectangles formed by splitting the modulo coordinate. Then a CDP is defined by the following recursive rule: if is a CDP, so is . In practice for our implementation, the real domain is mapped to a d-dimensional integer lattice. C. Adaptation for Sparse (Point, Value) Input For multidimensional functions, it quickly becomes impractical to have a training set with every point on the -dimensional integer lattice. The Haar DWT is easily adapted to efficiently handle regions or points of the space having no information. In [11], [12], this is done using the Paired Haar Transform for Boolean functions. For multiclass- and numerical-valued functions, we modify it to efficiently glob these points or regions
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
85
Fig. 4. Multiclass Haar DWT example. The first subfigure illustrates the first phase of the algorithm, where candidate sets are established for the low-pass coefficients (the left half of the arrays on each level). The second subfigure shows the second phase, where the candidate sets are narrowed to a single class and the high-pass coefficients (the right half of the arrays, on each level) are set accordingly. Only the high-pass coefficients and the final low-pass coefficient at the bottom are needed to represent the original values of the function, given in the top row. Dashes (-) represent unknown regions of the domain which have not been measured. See Section II-D for an explanation in words, or [30] for pseudocode.
together until it globs together with a point/region that does have a known value, and implicitly assigns the same value to all points in the entire region. (The exact mechanism will be shown by the handling of “don’t-care” values in Section II-D.) This introduces very little overhead to the algorithm, allowing its time complexity to be almost independent of the dimensionality (other than storing and reading the extra address bits) and basically linear (after a sort) in the size of the training set, which are the points with known values. Since this process of assigning values to regions with unknown values is also where generalizations are made, care might be taken to sample a few points from these unknown regions and approximate their value using nearest neighbor (or other) techniques rather than simply hoping the locality of neighboring points in the z-ordering will always make good generalizations (see also Section V-A1).
D. Modifications for Multiclass (Non-Numerical) Values Now that the multidimensional Haar DWT has been completely specified, the modifications to handle multiple class values are described here. We’ve made a more formal pseudocode description available on the web at [30]. Fig. 4 is intended as an instructive, perhaps unrealistic, example of the algorithm. This is similar to the previous example from Fig. 1, except that we now have class values represented by letters (standing for the colors black, cyan, green, purple, red, and yellow) instead of real numbers, and there are also unknown regions of the function, represented by dashes. As before, the low-pass coefficients express an approximation of the function, while the high-pass values express the differences needed to reconstruct the function from the approximation, so all of the low-pass coefficients except the final coarsest
86
one can be discarded. Since the DWT minimizes the number of nontrivial high pass coefficients, the final low-pass coefficient isn’t necessarily the most frequent value of the function; if it were selected this way, more nontrivial high-pass coefficients might be required. See the section on optimality (Section IV) to better understand the intuition behind the multiclass algorithm. The dashes represent unknown regions and are treated as “don’t-cares,” so they effectively take the same value as the neighboring region. Since high dimensional training sets are likely to be very sparse, the unknown points are not explicitly stored as don’t-cares; rather, the multidimensional integer coordinates are stored with point values, and neighboring point coordinates in the serialized ordering are checked to see if they are actually neighbors in the dyadic domain at a given resolution, before taking intersections or unions of candidate sets. (As a property of the z-ordering, a point in the sparse serialization is always guaranteed to be paired in the dyadic hierarchy with one of its two neighbors: the one with which it shares the longer string of most significant bits, in its interleaved multidimensional integer index.) For instructive purposes we explain the algorithm in two phases. The first phase, illustrated in Fig. 4(a) involves establishing “candidate sets” of low-pass coefficients. Instead of taking the mean of two low-pass coefficients to produce the low-pass coefficient at the next level, we take the intersection set of the two candidate sets, if there is an intersection. If there is no intersection, we take the union set. Whenever there is an intersection between the candidate sets, we also set the corresponding high-pass coefficient (recall this was the “halved-difference” in the standard Haar DWT) to zero. A high-pass coefficient corresponding to a candidate set formed as the union of two candidate sets is determined later, and represented as a ‘?’ in the first phase figure. In the second phase, completed in Fig. 4(b), all candidate sets are narrowed to a single class value and all undetermined highpass coefficients are also determined. All low-pass coefficients, starting at the coarsest (lowest) level are resolved, moving to finer (higher) levels. If the high-pass coefficient parallel (having the same index at the same level) to the low-pass coefficient is zero, the two parents of the low-pass coefficient (at the next level up) are set equal to the child; otherwise, the high-pass coefficient is assigned to the most-likely value from the candidate set of the parent whose candidate set does not include the child’s value, and the parallel high pass coefficient of the child is set to indicate which parent the high pass coefficient applies to (in Fig. 4, “L” and “R” subscripts denote the left or right parent, respectively), along with the differing parent’s value. Note that, as indicated in Fig. 4(b), since red is more likely than yellow in the original indices 9–12, then red is preferred over yellow in assigning the third low pass coefficient in the third-coarsest level. Only the high-pass coefficients and the final low-pass coefficients are necessary to reconstruct the function; all other low-pass coefficients may be discarded. Note that since the input can be given as a sparse set of point, neither the time or space complexity of value pairs on the algorithm is affected by the number of don’t-cares, since they don’t actually need to be stored or ever even acknowledged by the algorithm. When the algorithm pairs off neigh-
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
Od
Fig. 5. Exceptions can yield a factor of ( ) savings in stickers. The shaded areas are class 1, while the white areas are class 0. The first figure has white areas surrounded with stickers, while the second simply shows white stickers B and C on top of sticker A and shaded sticker D over sticker C, thus using half the number of stickers as the first figure.
boring points in the serialized array, it simply needs to check their coordinates to verify that the points really are neighbors at the given resolution; otherwise, there is an implicit unknown point between them. Thus, given original data points, only coefficients are computed. Therefore after the data has been sorted, the run time is linear in . More generally, , where is the number of classes it is bounded by and is the total number of bits required for the coordinates of each point. One should also note that all memory accesses are highly localized, likely in stark contrast to other multidimensional algorithms. Thus this algorithm should encounter a minimal number of memory cycle delays, making it extremely fast. III. FAST PARALLEL EVALUATION A. Exceptions and the Sticker Analogy A useful intuitive image is that the nonzero high-pass coefficients represent colored hyper-rectangular stickers which are placed on top of one another (in overruling fashion) in regions within the dyadic framework (such as those regions in Fig. 2). Since stickers are allowed to be placed on top of one another, the overruling stickers are like exceptions to rules, where an exception is defined to cover part of only the single sticker immediately below it -exceptions in the Haar framework don’t straddle multiple rules. As Fig. 5 illustrates, allowing these exstickers (where repreceptions can save a factor of about sents the dimensionality), since not allowing overruling stickers entails storing all the convex regions that encase the exceptions region, rather than simply storing the loose generalization and the exception. Intuitively, this capability to store exceptions is one of the major advantages of 3-layer threshold and boolean networks over their two-layer counterparts, which must use the indirect encoding of exceptions by explicitly storing all convex regions that encase a convex exception.
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
87
Fig. 7. Boolean network created from 5(b).
clearly . Regardless of the value of the sum , we can always set so that the network outputs the correct value for point (and all points with ’s signature set). These weights can be computed efficiently in a depth-first traversal of the tree structure of the wavelet coefficients. To use simpler Boolean output units, it is straightforward to simply change the output layer and connections to have boolean outputs, for classes. All weights to the output layer . However, now the amount of fan-in to would now be 1 or the output nodes becomes slightly sensitive to the assignment of classes to integers. Since it is a dense coding of classes, it would seem difficult to reduce fan-in very much without trial and error, but perhaps assigning the classes with the most nonzero coefficients to the integers with the fewest bits on would help. Though intended as a hardware implementation, the threshold network could also be simulated using sparse matrix multiplication (as was done in our experiments). Fig. 6.
Architecture of the threshold network.
B. Conversion to Threshold Neural Network This sticker analogy eases understanding of how the threshold network is constructed, as the network essentially implements each sticker, and combines them in the output layer. Fig. 6 illustrates the architecture of the threshold network. Each node in the first hidden layer implements a hyperplane parallel to a coordinate axis, so they each require only one of the inputs and a bias. During construction of the network, a hashtable is used to ensure identical first-layer nodes are not created. Each node in the second hidden layer corresponds to a nonzero high-pass coefficient, or hyper-rectangular sticker. These nodes are activated iff the input to the network is inside the region of the corresponding sticker; thus it is activated iff the input is inside all hyperplanes that delimit the hyper-rectangular region. Output nodes output the weighted sum of their inputs and bias, so each class needs to have a numerical assignment. To understand how to set the weights to the output units, notice a general point is inside the sticker-regions corresponding to some “signature” set of sticker-regions, or wavelet coefficients. This set of regions always has the property that region completely contains region . Also note that all points whose “signature” set is equal to are classified identically as by the wavelet-transformed function. Let represent the weight of the output connection from the node representing region ( represents the bias due to the final low-pass coefficient). Then given input point , the output of the network is
C. Fault Tolerance The original motivation of this work was actually to construct a fault tolerant multi-layer perceptron (MLP). Since all weights to the boolean output layer are 1 or , making a class-valued (not continuous real-valued) network completely fault tolerant to single faults is as simple as making two replications (for triple modular redundancy) of each subnet that feeds the output layer, as described in [31]. There it was proven the minimum number of replications required for fault tolerance is 2. Previous work [32], [33] was unable to achieve this lower bound. See [34] for recent work on fault tolerance for continuous real-valued MLPs. D. Conversion to Boolean Network Finally, we describe the simple conversion of everything after the first hidden layer of the threshold network (the inputs are not necessarily boolean) to a AND-AND-OR Boolean logic circuit of about the same size, for a two-class function, or Boolean functions for a k-class function. Fig. 7 shows the network created from the function shown in Fig. 5(b). Notice that in order to allow “exceptions,” three layers of logic (ignoring intermediate NOT gates) are needed. The first layer consists of an AND gate for each nonzero wavelet coefficient. Each outputs 1 whenever the input is inside the hyper-rectangular sticker-region for its corresponding coefficient. Thinking in 2-D, there is a colored sticker at the bottom that shows through in many places. Those stickers directly on top of the bottom sticker are its “exceptions.” We use a second level AND gate to represent this by ANDing the bottom sticker with the NOT of each of the stickers directly on top of it. Now
88
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
recursively treat the “exceptions to the exceptions” as if they were top level rules, making first level AND gates for each of them. All second level AND gates, as well as some first level AND gates who do not have exceptions, are input to the output OR gate.
solution always implies two optimal subsolutions for the left and right subproblems, by simply removing the top red sticker, and placing one or two half-sized red stickers over the two subregions. Proof: stickers. Case 1: The optimal subsolutions use a total of Replacing the top red sticker by two half-sized red stickers stickers. results in two optimal subsolutions using a total of Case 2: The optimal subsolutions use a total of more than stickers. This case never arises, since replacing the top red sticker by two half-sized red stickers results in two subsolutions using only stickers. Case 3: The optimal subsolutions use less than stickers. This case never arises, since it contradicts our overall solution being optimal. Case 4: The optimal subsolutions use a total of stickers. Subcase 4.1: There exists a pair of optimal subsolutions that are both topped by the same color. This case never arises, since replacing these top stickers of the same color by a double-sized sticker results in an overall stickers, contradicting that our solution is solution with optimal. Subcase 4.2: There exists a pair of optimal subsolutions, one of which is topped by the color red, the other by a different color. We take our overall solution, remove the top red sticker, and place a half-sized red sticker over one of the halves to produce such a pair of optimal subsolutions using a total of stickers. Subcase 4.3: All pairs of optimal subsolutions are topped by two differing colors, neither being red. This case never occurs. By replacing the large red sticker by two half-sized red stickers, we can produce a left side subsolution with stickers and a right side subsolution with stickers, . Now, suppose we examine an actual pair such that of optimal subsolutions, where the left side subsolution uses stickers and the right side subsolution uses stickers, such that . Due to the stated optimalities, and . Since these variables are all positive integers, we must have eior , so either the left or right subsolution topped ther by red is actually an optimal subsolution, contradicting the existence of this subcase. This proof could have been performed in one dimension, using thin stripe-shaped stickers of various lengths, rather than box-shaped stickers. Since the proof applies to the one dimensional case, problems of arbitrary dimension may be solved by first converting them to a one dimensional problem via interleaving the bits of their data coordinates into a single coordinate, as discussed in Section II-B. Since we’ve proven the problem has optimal substructure, dynamic programming can be used. Dynamic programming builds on the knowledge of all possible optimal subsolutions at some level of the transform to generate all possible optimal subsolutions at the next level. Now we can see that this is exactly what the algorithm in Fig. 4 does. Proposition 1: The candidate set for a low-pass coefficient produced by the algorithm from Section II-D is equal to the set of colors that top optimal solutions for the region of support for that low-pass coefficient.
IV. OPTIMALITY OF THE MULTICLASS TRANSFORM Now that the workings of the multiclass algorithm have been presented, we will see that it effectively uses dynamic programming to make optimal decisions about which candidate class to assign as a low-pass coefficient to a region. Each nonzero wavelet coefficient can be thought of as representing one multidimensional colored sticker (or rule/exception), as from Section III-A, where in this algorithm, each sticker is restricted to cover only a “ -dyadic hyperrectangle,” (based on the same ordering of coordinates for all rectangles) and a dyadic hyperof some -cyclic rectangle is refined as any possible part dyadic partition , as defined in Section II-B. The definition of optimality, then, is to find the smallest set of nonnull wavelet coefficients or such stickers that agree with the observed function. It seems easiest to describe things in two dimensions and then generalize to arbitrary dimensions. Suppose we have a transparent piece of glass upon which we place rectangular or square shaped opaque single-colored stickers, where different colors represent different classes. The (2-D) function is initially represquare-shaped stickers of the smallest size sented by placing appropriately on an n x n grid, so that the colors show through on the other side of the glass (the sticky side of the stickers are colored). Now, we begin the transform. We now remove some of the smaller stickers already placed, and place double-sized stickers (shaped as the most nearly square concatenation of two stickers from the previous level) appropriately so that the entire grid is covered by these larger stickers on our side of the glass. Looking through the glass from the other side, the original color pattern defined by the function should still be seen. These larger stickers correspond exactly to the low-pass coefficients that would be calculated by our multiclass transform from Section II-D and Fig. 4, and each sticker is placed over the region of support of its corresponding low-pass coefficient. However, before placing a sticker, we want to minimize the number of stickers used, so a sticker from the previous level is always first removed if it is the same color as the one being placed over it. Note that, when placing a larger sticker, at least one of the two stickers under it is removed (just as a low-pass coefficient must always agree with at least one of the higher-resolution parents it was formed from). Thus, at each level of the transform, the total number of stickers used to represent the function stays the same or goes down. It is also easy to verify that the largest stickers placed are exactly the same as the low-pass coefficients at that level of the transform, and all smaller stickers that are still in place correspond exactly to all the high-pass coefficients. Thus, minimizing the number of stickers required to represent the function (when they are placed within the constraints implied by the above algorithm) is exactly the same as minimizing the number of wavelet coefficients. Lemma 1: Suppose we have an optimal solution, using stickers, which is topped off by a red sticker. This
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
89
TABLE I RESULTS ON BENCHMARKS. (SEE THE SECOND PARAGRAPH OF SECTION V-A FOR AN EXPLANATION OF TABLE HEADINGS)
Proof: From the Proof of Lemma 1, every optimal solution (for a given region) can be constructed by placing a sticker over the two optimal subsolutions of the region’s two subregions (after removing one or two stickers from the top). The proof also illustrates that each optimal solution is constructed in one of two ways: the first way is given in case 1, and the second way in case 4.2. In case 1, the total number of stickers is reduced by one in the merged solution, while in case 4, the number of stickers required remains the same. We need to show that the algorithm generates a candidate set for the merged region that contains the top color of every optimal solution, and no other colors. Suppose we have case 1, where the merged optimal solution uses 1 sticker less than the subsolutions combined. Then the merged optimal solution can be topped by any color that can top both of the subsolutions. Clearly, the complete set of colors that accomplishes this is the intersection of the candidate sets of the subsolutions—each of these colors tops an optimal solution, and there are no other colors that top an optimal solution (since using a color not in the intersection will not reduce the number of stickers required by 1). So the algorithm correctly generates the complete optimal candidate set in this case. Since subcase 4.2 is the only possible subcase of case 4, case 4 is only possible when there is no intersection between the candidate sets of the optimal subsolutions. Then it follows from subcase 4.2 that the merged optimal solution can be topped by any of these candidates from either subregion. Thus, the algorithm correctly finds the complete set of colors that top an optimal solution, by taking the union of the candidate sets of the optimal subsolutions. Clearly, the stickers can be placed in such a way that any of these colors could be at the top of an optimal solution. Thus, the algorithm correctly generates the complete optimal candidate set in this case, and the proposition is proven. V. EXPERIMENTS A. Benchmark Results Results from application of the Haar DWT to 6 benchmarks are shown in Table I. All the benchmarks except two-spirals are available from [35]. Code to generate the two-spirals data is available from [36]. These classification benchmarks exclusively use numerical features, and have no missing features or values. We used the suggested size for training and test sets where applicable, and used a 75/25 train/test split for the wdbc benchmark. We used the 50/50 split of the aspect-angle dependent data for sonar. For two-spirals, we used the code to generate 386 patterns and split this into training and test sets of 194 and 192 by taking every other pattern from each spiral. In the table, the “Best” column shows the best generalization error achieved by any training method (by the early 1990s)
from the “summary table,” available from [35]. The “Default Error” column refers to the generalization error from simply guessing that all patterns belong to the most prevalent class. The “Nodes” column shows the number of nodes in the first and second hidden layers of the threshold network generated from the DWT (the number of 2nd layer nodes also equals the number of nonnull wavelet coefficients). A “yes” in the “Renorm” column indicates the individual coordinates were renormalized to have equal standard deviations. One big reason for performance discrepancies between benchmarks is that benchmarks might come preprocessed to different degrees, and to the advantage of different training methods. To keep things simple and ensure our results strictly reflect the performance of the Haar DWT (and not some preprocessing transform or denoising filter), we performed no other preprocessing of significance, applying the DWT directly to the benchmark data. It seems that slight perturbations of the data actually affect the generalization error by a few percent, so it should be noted that the standard deviation of the generalization error is small enough for the results to be meaningful. Basically, it looks like the Haar DWT does a decent job at classification for these benchmarks, ranging in dimension from 2 to 60 and number of classes from 2 to 26, though it is certainly too simple a method to challenge for best generalization. Since the Haar DWT learns a forced-split dyadic decision tree, one should expect its performance to be comparable but slightly worse than a more general decision tree. This was confirmed in [8], and by our results on the vowel benchmark. By tweaking the data with principal component analysis, we observed one tree with 54% generalization error. For comparison, [25] tabulates the performance of 17 other methods. Only three of them beat nearest neighbor, which had 44% error (the best performer had 39%). The CART decision tree [25] had 56% error, and a linear perceptron tied for worst, at 67% error. The multiclass Haar DWT seems to do best with the two-spirals benchmark, presumably because it is only in a 2-D space. Using a compact depth-first tree-based storage method described in [37], the function learned by the DWT could probably be stored in about 30 bytes or less. It is important to note that the use of the standard “dyadic” splits by the Haar DWT can greatly reduce the number of nodes in the first hidden layer (compared to a threshold network learned from a decision tree with more freedom in placing decision planes). If is the number of nodes in the second hidden nodes in the first hidden layer, layer, there might be up to since each hyperrectangular region is delineated by planes, and each plane is only “guaranteed” to be shared by (usually) two regions. But since the Haar DWT only allows the planes at certain offsets, there is a much higher probability that planes
90
will be shared by many hyperrectangular regions. For example, in the threshold network created from the letter benchmark, each hyperplane is evidently shared by of the hyperrectangular regions on average. In the sonar benchmark, each hyperplane is used in about 88% of the regions. 1) Constructive Regularization for Converging an Approximation: From some extensive experiments with pruning via bagging within the Haar dyadic framework (a method outside the scope of this paper) no or very little improvement in accuracy over the unregularized multidimensional Haar classifier was observed unless the training set contained significant noise in the form of incorrectly classed data points. Similar results were seen in by Scott and Nowak in [8]. This suggests that the noise level on these benchmarks is low enough that pruning yields unimpressive returns, and sometimes even throws away useful information. Therefore, instead of using pruning to regularize an approximation of an already-known class-valued function, a better method for the problem definition in Section I-D is to adaptively sample the true function, taking into account information about the approximation to fill in gaps by predicting where the approximated decision boundary is most likely to be farthest from the true decision boundary (preferably in an efficient way not directly affected by the dimensionality), and using the true value at those points to form an extended training set for a better-regularized approximation. However, specific methods for adaptive sampling are outside the scope of this paper. B. Semi-Analytical Result Though dyadic decision trees have been proven by Scott and Nowak [17] to be asymptotically near-optimal for converging to worst-case functions, their convergence seems less than adequate on more practical functions having smooth, locally linear decision boundaries, for which we suspect convergence is possible (see Section I-D). We show a simple example illustrating the slow convergence of decision trees having hyperplane splits that are parallel to coordinate axes. We examined, for different linear boundaries, whether integrated error volume can be reduced inversely proportionately as the number wavelet coefficients grows. This appears possible in two and three dimensions, though boundaries offset from “center” seem to require considerably more coefficients than a dyadic decision tree with “free splits.” In higher dimensions, convergence looks decidedly bad for . linear function boundaries normal to the vector , with the For example, consider the hypercubic domain decision boundary . This is shown for in Fig. 8. Suppose we would like to place just one hyperrectangular region to cover as much of the shaded region as possible without covering very much of the unshaded region. More precisely, half the original domain is misclassified, and we seek the hyperrectangle to reduce this error as much as possible. It appears the optimal hyperrectangle in this case is the hyper, and the opcube with one corner vertex at posite vertex at . Using a couple simple approaches 1 and Matlab scripts, it appears that as the error approaches 0.5 (not reduced at all by the cube). For ex, the optimal is approximately 0.996, and the ample, if
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
Fig. 8. Finding the optimal sized hypercube to reduce error of the shaded region. In two dimensions, error is reduced by 67%, but in 512 dimensions, the error is reduced by only about 4%.
error fraction is approximately 0.48. By contrast, if , the optimal is approximately 0.333 with an error fraction of only about 0.166. Since axis-parallel splits are almost completely ineffective at reducing error in high dimensions for this problem, the Haar DWT and decision trees allowing only axis-parallel splits suffer from the curse of dimensionality in a rather unacceptable way. So it seems error can successfully “hide in the corners” in high dimensions, apparently due to inflexibility caused by requiring the decision planes be parallel to a coordinate axis. For a decision tree architecture, this suggests allowing low-precision hyperplane bisections normal to vectors whose entries are . These seem like the simplest possible splits that in will still allow the decision tree to scale to arbitrary dimensions (obviously, it handles the previous problem trivially for any dimension). However, allowing these types of splits will create odd-shaped, highly oblique regions of space that actually become difficult just to sample from. This type of architecture has been harshly criticized in [38]. Generation of uniform samples is related to the problem of volume computation, but exact volume computation of convex polytopes has been shown to be #P-Hard [39]. However, a fairly new stochastic algorithm (developed in the 1990s) based on physical mixing [40] solves the simpler problem of approximating volume in time while generating uniform samples within the only convex region of interest. This sampling technique may have made better-converging decision trees feasible. For low-dimensional functions, though, the simplicity of the Haar DWT is an advantage and can compress data well, as evidenced by our results with the two spirals benchmark. VI. HAAR DWT OF DECISION TREES The multiclass Haar DWT can be generalized and extended to calculation of the (class-valued) Haar coefficients of a decision tree, as illustrated in this section. Where before each Haar
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
91
Fig. 9. Two-class function with numbered splitting planes. Class 1 is shaded and class 0 is unshaded.
coefficient of a standard multidimensional array corresponded to a “dyadic hyperrectangle” (see Sections II-B and IV), the Haar coefficients of decision trees can correspond to more general hyperrectangles with sides parallel to coordinate axes for decision trees having hyperplane splits parallel to coordinate axes (referred to hereafter as “parallel-split DTs”), or convex domain for arbitrary hyperplane polytopic regions of the splits, or conjunctive (convex) rules and nonstraddling exceptions (Haar-framework exceptions are defined in Section III-A) for even more general DTs. Though parallel-split DTs are simpler and more commonly used in practice, a more general linear combination split decision tree example is employed here to illustrate the ability of Haar coefficients to represent general convex polytopic regions of the domain. Given the two-class function in Fig. 9, a greedy top-down decision tree learner would output the tree in Fig. 10(a), whose nodes are numbered by the decision plane labeled as in Fig. 9. Left branches are taken for points on the upper/left side of the plane, and right branches are taken for points on the lower/right side. The multiclass Haar DWT is performed from the bottom up in the tree, where the high and low-pass coefficients of each node are computed from the low-pass coefficients of the node’s children using the candidate set intersection/union rules from Section II-D. Phase 1 low-pass candidate sets are enclosed in curly braces for each node, while nonnull phase two high-pass coefficients are enclosed in square brackets over the branch to the child node or leaf they influence. The sticker-rule covered for a high-pass coefficient is defined by the (often narrowable, though not always easily for trees with nonaxis-parallel splits) conjunction of conditions occurring above that point in the tree. A. Rulesets and E-Trees The E-tree is defined as a recursive tree of nonstraddling (defined in Section III-A) exception-rule implicants, where an implicant is considered an exception of its parent implicant, and an implicant can have multiple child-exception-implicants. These exception implicants are in one-to-one correspondence with the nontrivial high-pass Haar coefficients of a decision tree, and their parent-child relationships can be easily resolved with a depth first traversal starting at the root of the (hierarchi-
Fig. 10. Rebalancing example. Two decision trees for the two-class function in Fig. 9. The first tree is unbalanced, formed by greedy splits, while the second tree is the rebalanced version. Nodes are labeled by the number of the decision plane. Left branches are taken for points on the upper/left side of the decision plane, while right branches are taken for points on the lower/right side.
cally structured) nontrivial Haar coefficients. The example from Figs. 9 and 10 has a top-level implicant, excepted by two child implicants—a rectangular and small triangular region. Such a ruleset bears resemblance, particularly when derived from axis-parallel-split DTs, to the rulesets generated by the decision tree program C4.5 [3] or C5.0, as well as the popular R-tree family [41]–[43] of multidimensional spatial access methods. Similarity with the R-tree inspires the name “exception tree,” or E-tree. Compared to Fourier spectral storage, R-trees and E-trees have the advantage of spatial locality, so they can easily store any data (such as nonnumeric values and histograms) related to a region. 1) Comparison to Rule Generation in C4.5: The Haar DWT hierarchical rule generation seems to be a more elegant way to generate rules than is employed in C4.5. According to the summary of Chapter 5 of Quinlan’s book [3], C4.5 basically generates a rule for each leaf of the tree, discards unnecessary conditions in the rules, then discards rules that do not help the accuracy of the set, orders the rules to maximize accuracy, and finally establishes a single default class when no rule applies. Aside from the default rule then, the C4.5 rules do not seem to allow exceptions, which in some cases could necessitate a factor more rules, as explained in Section III-A and Fig. 5. of
92
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 1, JANUARY 2006
2) Interpretation of Data: Perhaps the most practically significant use of E-trees and the Haar coefficients of decision trees is their use in interpreting classification functions. Decision trees are difficult to interpret, but a hierarchical tree of rules and exceptions with histograms and other statistics at each node would allow a person to begin to visualize the structure of a dataset and hypothesize about underlying factors, much as one would do with 2D scatterplot of data. The E-tree could also enable faster selection of meaningful 2-D projections. B. Rebalancing Decision Trees Decision tree learning can be generalized from using point—value pairs as input, to using exception-implicant—value pairs (high-pass Haar coefficients) as input. Since the exception-implicants represent generalizations over regions rather than points of the domain, they can help the greedy algorithm to “look ahead” further, making it less greedy and leading to more balanced DTs. The rebalanced tree in Fig. 10(b) was obtained from the Haar coefficients of the tree in Fig. 10(a) in the following way. There are two nontrivial Haar coefficients to rebalance with (a rectangular and triangular region). It is easily found that these regions are separated with either the #4 or #7 planes, and #4 is preferred since it serves as a boundary on a region that requires more enclosing planes. The local parent-child rotations used by Murthy [6] to successfully rebalance trees wouldn’t be capable of this rebalancing unless it looked several levels down the tree for possible rotations. The Haar analysis opens up a myriad of possibilities for rebalancing algorithms with widely ranging complexities. There are different objectives these algorithms might try to optimize: minimize the maximum depth, or minimize the average depth according to the probability density of regions. There is a class of DTs that have the same E-tree, and the rebalancing algorithm would either try to find the one that optimizes the objective, or possibly find reason to merge and/or divide rules for a better decision tree. Rebalancing for parallel-split DTs would be much easier due to the ease of calculating intersections of planes with regions. A simple algorithm would be to attempt to balance the decision tree at each node in the E-tree by recursively choosing decision planes that balance the number of exceptions on either side of the plane, since each exception implies a number of additional decision planes. More generally, an algorithm might choose the plane that balances the number of “useful planes” on either side of the plane, and the Haar coefficients would be useful here in quickly deciding which planes are useful in a given range. VII. SUMMARY AND FUTURE WORK The following list summarizes the potentially useful capabilities of the multiclass, multidimensional Haar DWT as follows: • converting a decision tree to a highly interpretable rule/exception tree called an E-tree, with histograms and other statistics at each node; • standardized, spatially localized representation, or approximation of a multidimensional function;
•
fast rebalancing of decision trees—could be used to customize paper and pencil tax return forms based on differing prior probabilities; • parallelizing DTs into fixed depth (unlimited fan-in/out) fault-tolerant threshold or Boolean networks; • learning a dyadic decision tree from a training set of . (point, value) pairs on The multiclass multidimensional Haar DWT is very efficient, elegant, and simple. The coefficients have very useful interpretations as regions of space or a conjunction of conditions, allowing us to easily construct reasonably sized, fast-evaluating fixed depth boolean or threshold networks, and even help to balance general binary decision trees. Though we have seen how the performance of decision trees with axis-parallel splits (and the Haar DWT) can fall off dramatically with increasing dimension, the efficiency and simplicity of the Haar DWT might be a powerful combination for quickly learning accurate dyadic decision trees in lower dimensions, by compiling a bagged or boosted sum of dyadic trees into a single fast-evaluating tree. We think a decision tree architecture with low-precision splits may make storage normal to vectors with entries in complexity tractably scalable for arbitrarily large dimensional functions. Similar tree architectures have been rather harshly criticized [38] due to the difficulty of dealing with irregular and oblique regions it creates, but recent advances in uniform sampling for volume approximation [40] may have made this strategy ultimately feasible. The resulting decision tree could still be realized as a fixed depth boolean or threshold network, with the help of the multiclass Haar DWT. Even for parallel-split DTs the theme of low-precision splits seems to be a good idea for interpretability. An algorithm like that for minimizing incompletely specified FBDDs [13] disto cussed in Section I-B could be applied recursively on choose the optimal coordinate to split on, a single binary bit at a time. The resulting E-tree should have nice round numbers for the conditions of each rule, making interpretation easier and storage cheaper. ACKNOWLEDGMENT The authors would like to thank the reviewers for their ideas and suggestions. REFERENCES [1] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [2] R. Mulvaney and D. S. Phatak, “Efficient realization of classification using modified Haar DWT,” in Proc. Int. Joint Conf. Neural Networks, Portland, OR, July 2003, pp. 1774–1779. [3] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. [4] I. K. Sethi, “Entropy nets: From decision trees to neural networks,” Proc. IEEE, vol. 78, pp. 1605–1613, Oct. 1990. [5] G. L. Foresti and C. Micheloni, “Generalized neural trees for pattern classification,” IEEE Trans. Neural Netw., vol. 13, pp. 1540–1547, Nov. 2002. [6] K. V. S. Murthy, “On Growing Better Decision Trees From Data,” Ph.D. dissertation, Johns Hopkins, Baltimore, MD, 1995. [7] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. Cambridge, MA: MIT Press, 1990.
MULVANEY AND PHATAK: GENERALIZED HAAR DWT AND TRANSFORMATIONS
[8] C. Scott and R. Nowak, “Dyadic classification trees via structural risk minimization,” in Proc. Neural Information Processing Systems (NIPS), Vancouver, Canada, Dec. 2002. [9] T. Y. Young and T. W. Calvert, Classification, Estimation, and Pattern Recognition. New York: Elsevier, 1974. [10] B. J. Falkowski and C.-H. Chang, “Forward and inverse transformations between haar spectra and ordered binary decision diagrams of boolean functions,” IEEE Trans. Comput., vol. 46, pp. 1272–1279, Nov. 1997. , “Calculation of paired haar spectra for systems of incompletely [11] specified boolean functions,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), vol. 6, May 1998, pp. 171–174. , “Paired haar spectra computation through operations on disjoint [12] cubes,” Inst. Elect. Eng. Proc. Circuits, Devices and Systems, vol. 146, pp. 117–123, Jun. 1999. [13] C.-H. Chang and B. J. Falkowski, “Haar spectra based entropy approach to quasiminimization of FBDDs,” Inst. Elect. Eng. Proc. Computers and Digital Techniques, vol. 146, no. 1, pp. 41–49, 1999. [14] W. Gunther and R. Drechsler, “Minimization of free BDDs,” in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), vol. 1, Jan. 1999, pp. 323–326. [15] T. Sasao and M. Fujita, Eds., Representations of Discrete Functions. Norwell, MA: Kluwer, 1996. [16] R. S. Stankovic and J. T. Astola, Spectral Interpretation of Decision Diagrams. New York, NY: Springer-Verlag, 2003. [17] C. Scott and R. Nowak, “Near-minimax optimal classification with dyadic classification trees,” in Proc. Neural Information Processing Systems (NIPS), Vancouver, BC, Dec. 2003. [18] B. Park and H. Kargupta, “Constructing simpler decision trees from ensemble models using fourier analysis,” in Proc. 7th Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM SIGMOD 2002, 2002, pp. 18–23. [19] H. Kargupta and H. Park, “A fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 2, pp. 216–229, 2004. [20] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Proc. Int. Joint Conf. Neural Netw., vol. 1, Washington, DC, 1989, pp. 593–605. [21] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, “Wavelets for computer graphics: A primer, part 1,” IEEE Comput. Graph. Appl., vol. 15, pp. 76–84, May 1995. [22] D. Debnath and T. Sasao, “A heuristic algorithm to design AND-OREXOR three-level networks,” in Asia and South Pacific Design Automation Conf., 1998, pp. 69–74. [23] T. Sasao, “OR-AND-OR three-level networks,” in Representations of Discrete Functions, T. Sasao and M. Fujita, Eds. Norwell, MA: Kluwer Academic, 1996. [24] C. Xiang, S. Q. Ding, and T. H. Lee, “Geometrical interpretation and architecture selection of MLP,” IEEE Trans. Neural Netw., vol. 16, Jan. 2005. [25] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning—Data Mining, Inference, and Prediction. New York: Springer, 2001. [26] T. Kugarajah and Q. Zhang, “Multidimensional wavelet frames,” IEEE Trans. Neural Netw., vol. 6, pp. 1552–1556, Nov. 1995. [27] M. G. Karpovsky, Finite Orthogonal Series in the Design of Digital Devices. New York: Wiley, 1976. [28] C. Faloutsos and S. Roseman, “Fractals for secondary key retrieval,” in 8th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Syst. PODS, 1989, pp. 247–252. [29] J. A. Orenstein and T. H. Merrett, “A class of data structures for associative searching,” in Proc. 3rd SIGACT-SIGMOD Symp. Principles of Data Base Systems, 1984, pp. 181–190. [30] Multiclass multidimensional Haar DWT Supporting Information, R. Mulvaney and D. S. Phatak. [Online]. Available: http://www.cs.umbc.edu/~phatak/sw/mmh-dwt [31] D. S. Phatak and I. Koren, “Complete and partial fault tolerance of feedforward neural nets,” IEEE Trans. Neural Netw., vol. 6, pp. 446–456, Mar. 1995. [32] D. S. Phatak, “Fault tolerant artificial neural networks,” in Proc. 5th IEEE Dual Use Technologies and Applications Conf., Utica, May 1995, pp. 193–198.
93
[33] D. S. Phatak and E. Tchernev, “Synthesis of fault tolerant neural networks,” in Proc. Int. Joint Conf. Neural Networks, Honolulu, HI, May 2002. [34] P. Chandra and Y. Singh, “Feedforward sigmoidal networks—Equicontinuity and fault-tolerance properties,” IEEE Trans. Neural Netw., vol. 15, pp. 1350–1366, Nov. 2004. [35] C. Blake and C. Merz, “UCI Repository of Machine Learning Databases,”, http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998. (accessed 7/2005). [36] CMU AI Repository Neural Benchmarks. [Online]. Available: http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/neural/bench/c mu/bench.tgz [37] “Tech. Rep. TR-CS-03-21,” Univ. Maryland, Baltimore County, MD, Jan. 2003. [38] L. Breiman and J. H. Friedman, “Tree-structured classification via generalized discriminant analysis: Comment,” J. Amer. Statistical Soc., vol. 83, pp. 725–727, Sep. 1988. [39] M. Dyer and A. M. Frieze, “The complexity of computing the volume of a polyhedron,” SIAM J. Computing, vol. 17, pp. 967–974, 1988. [40] R. Kannan, L. Lovasz, and M. Simonovits, “Random walks and an ( ) volume algorithm for convex bodies,” Random Structures and Algorithms, vol. 11, pp. 1–50, Aug. 1997. [41] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in ACM SIGMOD, 1984, pp. 47–57. [42] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: An efficient and robust access method for points and rectangles,” in Proc. SIGMOD Int. Conf. Management of Data, May 1990, pp. 322–331. [43] I. Kamel and C. Faloutsos, “Hilbert R-tree: An improved R-tree using fractals,” in Proc. 20th Int. Conf. Very Large Databases, Santiago, Chile, 1994, pp. 500–509.
O n
Rory Mulvaney received the B.S. degrees in computer science (summa cum laude) and math (magna cum laude) from the University of Minnesota in 2000. He received the M.S. degree in computer science in 2002. As an undergraduate, he wrote a parallelized quantum computer simulator for a senior project, and after graduation, he continued research in quantum computation in graduate school at the University of Maryland, Baltimore County. He switched focus to neural networks and pattern recognition. He is a Ph.D. degree candidate, advancing methods for accurate function value estimation using prior samples, with application to gene expression data in clinical trials.
Dhananjay S. Phatak received the B.Tech. degree in electrical engineering from the Indian Institute of Technology, Bombay, in 1985, the M.S. degree in microwave engineering in 1990, and the Ph.D. degree in computer systems engineering in 1993, both from the Electrical and Computer Engineering Department, University of Massachusetts, Amherst. From 1994 to 2000, he was an Assistant Professor of Electrical Engineering at the State University of New York, Binghamton. Since Fall 2000, he has been an Associate Professor with the Computer Science and Electrical Engineering Department, University of Maryland Baltimore County. His current research interests and activities span the areas of: 1) mobile and high performance computing and networks; 2) computer arithmetic algorithms and their VLSI implementations, digital and analog VLSI design and CAD; and 3) neural networks. In the past, he has worked on microwave and optical integrated circuits. He has published articles in the IEEE TRANSACTIONS in several diverse areas (computers, neural networks, and microwave theory and techniques), as well as in other premier journals and conferences in his areas of research. His high-speed CORDIC algorithm has been awarded a patent. Dr. Phatak is a recipient of the National Science Foundation’s CAREER Award in 1999. He recently completed a three-year term as an Associate Editor for the IEEE TRANSACTIONS ON COMPUTERS. He has been active on the technical program committees of premier conferences in his areas of research (including ACM Mobicom, IEEE INFOCOM, and International Symposium on Computer Arithmetic and IJCNN). He has received research support from the NSF, Lockheed Martin, and AetherSystems Inc.