Number theoretic transforms in neural network image classi cation T.P. Harte and R. Hanka, University of Cambridge, School of Clinical Medicine, Medical Informatics Unit, Robinson Way, Cambridge, CB2 2SR, England; telephone: +44 1223 330305; fax: +44 1223 330330; e-mail:
[email protected].
Abstract
Judicious use of number theory and abstract algebra can increase the eciency of neural network pattern classi ers. We use the example of our neural network tissue classi cation of MRI of the breast to illustrate that ecient pattern recognition algorithms can be designed from algebraic fundamentals. The algorithm is based on number theoretic transform speedup of an FFT-based convolution scheme.
1 Introduction With the aid of statistical theory the neural network has transcended its `black box' image and enjoys widespread usage [2]. It is an inherently exible pattern classi er and, being non-linear, is arguably the most general. Many applications are time-critical and, without optimization algorithms, the neural network may be unable to produce results speedily: for example, our neural network classi er of tissue in magnetic resonance images (MRI) [3]. We have presented an FFT-based speed-up algorithm, where we observed that when an image mask forms the input to feed-forward neural networks, such as the multi-layer perceptron, the underlying operation is spatial convolution [4]. We achieve this multi-dimensional convolution via the FFT. Much eort of the VLSI community has been placed into ensuring that oating-point arithmetic, upon which the FFT relies, is nowadays blindingly fast compared to the hardware of yesteryear. However, integer multiplication is simpler than oating-point. For example, a oating-point multiply is an integer mul and an integer add, followed by a normalization step of at most two shifts, plus special handling for all the special cases (zero, Inf, NaN, etc.) Herein lies a philosophical quandary: should we build hardware that makes inherently complex arithmetic fast, or should we concentrate on working with algorithms more suited to hardware implementation? Modern designs have opted for the former. We present a number theoretic transform (NTT) implementation of our classi er and demonstrate the algorithmic gains that nite integer methods place at our disposal. Nussbaumer [7] provides an excellent exposition of the NTT, along with a sound treatment of the general case of convolutions performed by discrete orthogonal transforms. Reading for the background algebra can be found in Lipson [6]; Rosen's [9] presentation of the principles of number theory is quite lucid. We outline the theory and uses of NTTs here. Proceedings of the 1st International Workshop on Statistical Techniques in Pattern Recognition, P. Pudil,
J. Novovicova and J. Grim (Editors), IAPR, Prague, June 9{11, 1997
2 Neural Networks: mask processing is convolution Our image data are 3-D, time-evolving MRI volumes of the breast. Malignant lesions enhance with uptake of the contrast medium; for classi cation purposes, therefore, we exploit individual pixels' time characteristics by examining a volume centred on a particular pixel. This is really a 3-D mask process. Note that the values of pixels contained within this volume also change in time as pixels range from t0 to t3. Our input data eectively form a 4-D hyper-cube, and we feed L2-normalized pixel time dierentials into a pre-trained back-propagation neural network for classi cation. We have shown that areas pinpointed by the neural network classi er match those ringed as suspicious by an experienced radiologist [3]. Feed-forward neural networks with two layers1 of sigmoidal units having N input variables, M hidden-layer nodes and L output nodes, model functions on R N of the form zl (x) = (2)
MX ?1 m=0
w(2)
lm (1)
NX ?1 n=0
w(1) x
mn n
+ b(1)
!
m
!
+ b(2) : l
(1)
The sigmoidal functions (i) are bounded `saturation' functions on the real line for which (i) (z ) ! 0 as z ! ?1 and (i) (z ) ! 1 as z ! +1. We can express each sigmoidal layer as: cm = dl =
NX ?1 n=0
(1) x + b(1) ; wmn n m
MX ?1 m=0
(2) y + b(2) ; wlm m l
ym = (1) (cm );
(2)
zl = (2) (dl );
(3)
where the wmn xn and the wlm ym compute the inner product of vectors in R N and R M , respectively. For a particular sigmoidal layer i, the vectors (wj(i0) ; : : : ; wj(i)?1)T are the weighted connections from the outputs of the preceding layer (x(0i?1); : : : ; x(i??11))T , to the j th node of this layer. The b(ji) are biases for the j th node of sigmoidal layer i. Using our image mask, suppose we attempt to classify a complete image set with a neural network which has eight nodes in the rst sigmoidal layer. Then, in order to compute cm in Eq. (2), a total of depth time slices time z}|{ z}|{ z}|{ z}|{ 16 16 8 2:4 1011 4 28 256 256 {z } | {z 4 4 } |{z} | image
mask
nodes
(4)
multiplications and additions must be carried out at the input stage of the neural network alone. It would seem that mask processing is not computationally feasible, considering the time constraints required for a clinical application. We achieve signi cant computational economy by showing that mask processing in this context is spatial convolution [4]. Observe that the inner product in Eq. (2) is conventionally written in 1-D array terms [2, Page 118] but, for mask processing, a hyper-cube forms the input to the rst hidden layer. Thus, we write the x's and w's as functions of the hyper-cubic image indices, e.g. x(n0; n1 ; n2 ; n3 ). As the mask slides along the image, this describes a multi-dimensional convolution of the mask with the image set. We think in terms of the invariant weights wm (n0; n1 ; n2 ; n3 ) for each node sliding 1
We adopt the convention advocated by Bishop [2]: \the rst layer" denotes the rst hidden -layer.
along the image instead of a hyper-cube centred on a pixel-of-interest forming a one-dimensional input array x. To clarify, picture our data in two array dimensions only, that is, as N0 N1 square images. The input to the j th node of the rst sigmoidal layer in Eq. (2) becomes: cm =
NX ?1 n=0
(1) x + b(1) =) c(; ) = wmn n m m
NX 1 ?1 0 ?1 N X n0 =0 n1 =0
wm (n0 ; n1 )(1) x( ? n0 ; ? n1 ) + b(1) m;
(5)
for a N0 N1 image. With the convolution theorem [8, Page 110] we could then write: wm (; )(1) ~ x(; ) , Wm (u; v )(1) X (u; v )
(6)
in place of the sum terms, where Wm (u; v)(1) and X (u; v) are the discrete Fourier transforms of wm (; )(1) and x(; ) respectively, de ned by the forward (DFT) and inverse (IDFT) transforms: 4
DFT[x] = X (u; v) =
NX 0 ?1 N 1 ?1 X
n0 =0 n1 =0 NX 1 ?1 0 ?1 N X
IDFT[X ] = x(n0 ; n1 ) = N 1N 4
x(n0 ; n1 )!Nn00u !Nn11v
0 1 u=0 v=0
u = 0; 1; : : : ; N0 ? 1; v = 0; 1; : : : ; N1 ? 1;
X (u; v )! ?un0 ! ?vn1 N0
N1
(7)
n0 = 0; 1; : : : ; N0 ? 1; (8) n1 = 0; 1; : : : ; N1 ? 1;
where the !N = e?j ( 2N ) = cos( 2N ) ? i sin( 2N ) and i2 = ?1. Thus, we can state Eq. (5) as: (1) c(; )m = IDFTfW (u; v )(1) m X (u; v )g + bm ;
(9)
where IDFT indicates the inverse transform. At each of the M nodes for the rst sigmoidal layer we must compute a new transform Wm (u; v)(1). The image transform X (u; v) obviously remains unchanged at each of these nodes. It is straightforward to extend this procedure to higher array dimensions and we thus surmount the computational bottleneck caused by mask processing. Because the exponential transform kernel !N is separable, we compute multi-dimensional FFTs using multiple 1-D FFTs across all the indices. Thus, in the 2-D illustration above, we would rst obtain 1-D FFTs of the rows and then 1-D FFTs of the columns [8, Page 117].
3 Indirect convolution using NTTs For long sequence lengths, such as those found in image processing, ecient implementation of nite convolution is achieved via the indirect DFT method, using the fast Fourier transform algorithm (FFT), cf. Eq.(9). The NTT overcomes the two principal drawbacks of the FFT. First, all FFT multiplications are complex and require a minimum of three oating-point multiplications each. Second, because digital computation is possible within a nite word-length only, the FFT must approximate the powers of e?i( 2N ) situated on the complex unit circle in Eq. (7) and Eq. (8). These are (mostly) irrational and so round-o error is introduced. The NTT represents real data precisely: only integers are multiplied. As they are exact, they are useful in convolutions, multiple-precision calculations, and high-order polynomial and large-integer multiplication. Our strategy is: To take advantage of the simplicity of arithmetic for hardware and software, we perform integer-only indirect convolutions; and we wish to base these algorithms on the FFT to pro t from its highly ecient matrix decomposition.
NTTs have their arithmetic operations de ned on nite integer rings, or on nite integer elds2 , which are algebraic structures dierent from C , the (in nite) complex eld, over which the DFT is de ned. Modulo some integer p we can de ne addition, subtraction and multiplication. The resulting structure is an integer ring (or eld, if p is prime). The integers Z are each mapped into one of p residue classes according to the division property: for a; b; p; r 2 Z ; a = bp + r; 0 r < kpk; b is termed the quotient and r the residue (remainder.) If r = 0 then we say that \p divides a," which we denote by p j a. Otherwise, in number theoretic terms, we write a r mod p, and we say \a is congruent to r modulo p," the quotient b being discarded. Clearly, then, if a 0 mod p we know that p j a exactly. Similar to the FFT, NTTs have separable kernels and so we can achieve multi-dimensional transforms by performing 1-D transforms over all the multi-dimensional indices. We thus concentrate on the 1-D NTT for clarity here. We denote the ring by Z p and we de ne NTTs as: FFTs mod p. Thus: 4
NTT[x] = X (k) = 4
*N ?1 X
*
n=0
INTT[X ] = x(n) = N ?1
x(n)rnk NX ?1 k=0
+
p
;
X (k)r?kn
k = 0; 1; : : : ; N ? 1;
+ p
;
n = 0; 1; : : : ; N ? 1;
(10) (11)
where the < >p notation indicates residue extraction mod p, which is the number theoretic term for nding the remainder. A number r is called an N th root of unity if rN = 1; it is a primitive root of unity if k r 6= 1; 8k; 1 k < N . The order of an element r is the smallest exponent d for which rd = 1. For a primitive N th root, then, rN +k = rk . It is precisely the cyclic, or modular, nature of primitive roots that forms the basis for the matrix factorizations which render the FFT so ecient. We have, therefore, extended our FFT algorithms in [4] to the NTT. If the N th root of unity is two (non-multiplier transforms), the transform consists of addition, subtraction, multiplication implemented as fast binary shifts, and residue reduction: X (k) =
*N ?1 X n=0
x(n)2nk
+
p
;
k = 0; 1; ; N ? 1:
(12)
As an example, let N = 8, and let the input sequence be x(n) = f0; 3; 2; 1; 1; 1; 0; 0g. In Z 17 we note: 28 1 mod 17. Eq. (10) computes the sequence X (k) = f8; 2; 11; 9; 15; 12; 4; 7g. The inverse requires normalization by N ?1; the inverse ofP two is nine: 2 9 1 mod 17. We return to x(n) by computing the inverse x(n) = h k X (k)9nk i17, which is x0 (n) = f0; 7; 16; 8; 8; 8; 0; 0g. For N = 8, we have 8 15 1 mod 17 : N ?1 = 15. When we multiply the x0 (n) by N ?1 15 mod 17, we arrive at x(n) = f0; 3; 2; 1; 1; 1; 0; 0g once again3 . Convolution is now straightforward: if we convolve, say, x(n) = f1; 2; 2; 0; 0; 0; 0; 0g with itself, the transform domain sequence X (k) is f5; 13; 7; 9; 1; 5; 8; 11g. Squaring each term yields Y (k) = f8; 16; 15; 13; 1; 8; 13; 2g; inverse transforming gives: f8; 15; 13; 13; 15; 0; 0; 0g, followed by the normalized result: f1; 4; 8; 8; 4; 0; 0; 0g. A nite group is a set for which a binary operation (usually denoted by or juxtaposition) is de ned; which satis es closure; and has an identity element and an inverse for every element of the set. Rings are group extensions which have two binary operations, which we call addition and multiplication. The same rules apply as for groups, with the exception that the multiplicative and additive identities cannot be the same, and multiplicative inverses need not exist|if they do, the ring is a eld [6]. 3 Had we multiplied the sequence X (k) by N ?1 to get X 0 (k) = f1; 13; 12; 16; 4; 10; 9; 3g, the inverse of this sequence would give us fx(n)g directly. 2
4 Implementation We require the following transform conditions for a suitable NTT: (i) N is highly composite, e.g., N = 2 for FFT matrix decomposition; (ii) the binary expression of the modulus M should be simple for modular arithmetic; (iii) r should be simple: 2, for example. We relax the latter constraint as all integer arithmetic is generally fast nowadays. We need a suciently long modulus to accommodate our sequence values. In Z p we can unambiguously represent the convolution sequences y(n) = x(n) ~ h(n) if they are scaled so that jy(n)j does not exceed p=2. AsPwe know fh(n)g a priori , we can bound the peak output magnitude by jy(n)j jx(n)jmax nN=0?1 jh(n)j. Real data is accommodated by a post-convolution scaling factor 10?q , where q is the desired precision. There are few prime moduli which we can work with: the requisite word-lengths are either too long or do not support a suitable convolution length. Thus, our images at 256 256 depth time each require multiple 1-D transforms of maximum length N = 256, and are particularly long compared to the short-lengths previously imposed upon NTTs due to modulus and root constraints [1]. we use a composite modulus. The prime factorization of any composite is M = QlInstead, i . Notionally at least, we perform multiple NTTs modulo the individual pri , as suggested p i=1 i i by Agarwal & Burrus [1]. We nd a suitable root r using the Chinese Remainder Theorem [9, Page 127], such that rN 1 (mod pi i ); i = 1; 2; : : : ; l: (13) To possess an inverse, an element of a ring or eld must be relatively prime to the modulus, viz., gcd(a; M ) = 1, which we write a ? M . Euler's totient function enumerates relative primes, e.g., '(8) = 1; 3; 5; 7. It is obvious that for p a prime, '(p) = p ? 1; for p composite '(p) < p ? 1. If the modulus is a prime, then Z p is a eld and all the elements of Z p are invertible4 . Maximal order (primitive) roots r `generate' the invertible elements of Z p in some order according to their powers: r ; 1 < p. For M composite we require that N jgcdf(p1 ?1); (p2 ?1); : : : ; (pl ?1)g [1]. As Lipson states [6, Page 304], we want 2 j(pi ? 1), viz. primes of the general form p = 2m k + 1, for odd k. There are plenty of primes which allow adequate precision: such as p = 3 5 227 +1 (31 bits), or p = 27 259 + 1 (64 bits), and primes of this form are not exceptional [6, Page 304{305]. To satisfy condition (i), however, we need to work with primes which support lengths 28 . Borrowing an idea from Kastrup's FIR implementation [5], we convolve our multi-dimensional sequences modulo a Mersenne number5 to satisfy condition (ii). However, in theory we carry out the NTT modulo the Fermat primes6 of the composite's prime factorization. Thus, we use 3 4 32 2 2 the modulus M = 2 ? 1 = 255 257 65537 = 4294967295, which is 255 (2 + 1) (2 + 1) or 255 F3 F4 exactly, and, of course, 28 j(F3 ? 1) and 28 j(F4 ? 1). The convolution results are only truly valid mod 257 65537, and so we simply discard the eects of the 255 in our nal interpretation (reduction). To clarify, consider a decimal number: to check if the number is even, i.e., congruent to 0 mod 2, we reduce the nal digit mod 2: < d0 >2 . We simply discard the mod 5 results, even though our number is in decimal, i.e. mod 10 , mod 2 5. But we reap the bene ts of ecient arithmetic results performed mod 232 ? 1, e.g. when we multiply two transforms together, Yk = Xk Hk , we get the products Yk , which may have up to 2M bits each 4 5 6
Zp is Zp excluding the 0 element.
Mersenne numbers exist according to the general formula, Mm = 2m ? 1, where m is any positive integer. 2t Fermat numbers are: Ft = 2 + 1; t = 0; 1; 2; : : :, the rst ve of which are prime.
(requiring a 64-bit word). If a is the sub-word made from the p least signi cant bits of the Yk and if b is the sub-word made from the M remaining bits, then Yk a + b mod M , which entails adding them and adding in any resulting carry ag also. Otherwise, multiplying two numbers modulo 257 65537 would involve taking the 64-bit product, then performing a 64 by 32-bit division by 257 65537, and divisions are very much slower than additions on most processors. To perform the complete convolution for each 1-D subset of an image we NTT transform the image sequence and the mask sequence, we obtain the products (as explained), inverse transform and perform a nal reduction modulo 257 65537. The latter introduces the only signi cant ineciency as most processors nowadays obtain residues after trial division, which is slow. However, this is performed just once for each output point.
5 Discussion NTTs are useful for convolution of sequences, for the exact multiplication of large integers etc.. Their application to multi-dimensional signal processing is computationally economical from the software point of view; they oer especial advantages to the hardware designer. Software simulations of NTTs which use integer-only arithmetic signi cantly improve the eciency of high-volume computation of mask processing techniques in neural networks. The overall eect is to produce a substantial (though machine-dependent) reduction in execution time using fundamental algebraic reasoning alone. Further software optimization is possible when data ow considerations, such as hierarchical memory access etc., are accounted for.
References [1] R.C. Agarwal and C.S. Burrus. Fast convolution using Fermat number transforms with applications to digital ltering. IEEE Transactions on Acoustics Speech and Signal Processing, ASSP-22, No. 2, April 1974, pp. 87{97. [2] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Clarendon Press, Oxford, 1995. [3] R. Hanka, T.P. Harte, A.K. Dixon, D.J. Lomas, and P.D. Britton. Neural networks in the interpretation of contrast-enhanced magnetic resonance images of the breast. In Current Perspectives in Healthcare Computing Conference, BHJC Books, Harrowgate, UK, March 1996, pp. 275{283. [4] R. Hanka and T.P. Harte. Curse of dimensionality: classifying large multi-dimensional images with neural networks. In Computer-Intensive Methods in Control and Signal Processing , K. Warwick and M. Karny (Ed.s), Birkhauser Boston, New York, 1997, pp. 249{260. [5] D. Kastrup. Schnelle Faltungsalgorithmen mittels Restklassenarithmetik, Cand. Ing. Thesis, Institut fur Nachrichtengerate und Datenverarbeitung, Rheinisch-Westfalische Technische Hochschule Aachen, 1994. Also, private communications with the author. [6] J.D. Lipson. Elements of Algebra and Algebraic Computing. Addison-Wesley, Redwood City, California, 1981. [7] H.J. Nussbaumer. Fast Fourier transform and Convolution Algorithms. Springer-Verlag, New York, 1981. [8] A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice-Hall International, London, 1975. [9] K.H. Rosen. Elementary Number Theory and Its Applications. Addison-Wesley, Reading, Ma., USA, 1988.