Information Theory and Computer Science Interface - IEEE Xplore

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

1385

Problems on Sequences: Information Theory and Computer Science Interface I. INTRODUCTION

R

ECENT years have seen a proliferation of research in “Problems on Sequences” which has benefited from the interplay between information theory and computer science—each of these two fields has had an impact upon the other in providing design paradigms and in providing ways of obtaining performance bounds. Because of the continued expansion of this research interest, it was deemed to be an opportune time for this special issue, devoted to this interface. The papers in this special issue well illustrate this theme. We have organized our discussion of these papers into the following subareas: • Analysis of Algorithms in Problems on Sequences • Data Structures in Problems on Sequences • Complexity Issues • Estimation/Prediction in Problems on Sequences Some of the contributed papers exhibit more than one of these four subareas. Our classification was made on what we felt was the principal emphasis of each paper.

can also be elucidated. Their approach involves getting exact and asymptotic formulas for the number of individual sequences belonging to a Markov type class; they do this in a clever way by reducing the problem to the more manageable problem of counting Eulerian paths in a multigraph. B. Power Laws in Language Models Let there be given a memoryless source with finite alphabet of size , and let be the set of all nonempty finite strings over . For each string , let be the probability with which the source generates . (To avoid trivialities, for all .) Let we suppose that be a ranking of the elements of

in which

Using complex analysis tools from the analysis of algorithms, Conrad and Mitzenmacher [7] establish a power law for the se: that is, they find positive constants , , quence for which

II. ANALYSIS OF ALGORITHMS IN PROBLEMS ON SEQUENCES Analysis of algorithms, a field established by D. Knuth, consists of powerful techniques for analyzing the asymptotic performance of algorithms on sequences. Complex analysis is a prime tool in the analysis of algorithms “toolbox.” Two papers in this special issue, by Jacquet and Szpankowski [20] and Conrad and Mitzenmacher [7], are prime illustrations of the use of analysis of algorithms to attack problems on sequences. A. Asymptotics of Minimax Redundancy In the 1970s, it was established [41], [37] that the worst case block minimax redundancy of an individual sequence of length with respect to the family of th-order Markov information sources on a finite alphabet of size can be expanded to first order as (1) Jacquet and Szpankowski [20] have now settled the problem of obtaining the second-order asymptotics in (1). They deterdepending on , such that mine an explicit constant the minimax redundancy just discussed can be expanded more sharply as (2) The techniques they employ are so powerful that it is believterm in (2) able that with a little more work the third-order

Digital Object Identifier 10.1109/TIT.2004.830747

for all sufficiently large. In [7], this power law result is offered in the context of an artifical language “monkey typing” model introduced by Mandelbrot as a counterpoint to natural language models. We point out here another interpretion of this problem in the context of Tunstall codes for the given source. For each positive integer , the Tunstall algorithm [43] builds leaf vertices, nonleaf vera rooted tree with tices, and outgoing edges per nonleaf vertex, such that at each nonroot, nonleaf vertex there is a unique string belonging that is represented at that vertex by folto lowing the path from root to that vertex and writing down a label from on each edge along the way. This rooted tree is used to build the variable-length to fixed-length Tunstall code binary codewords for the given source; having this is the lossless variable-length to fixed-length code having this many codewords that yields minimal expected compression rate. The Conrad–Mitzenmacher result potentially has implications regarding the asymptotic behavior of Tunstall code performance. III. DATA STRUCTURES IN PROBLEMS ON SEQUENCES Papers [18], [25], [40], [30], [35] illustrate the use of data structures such as grammars, graphs, trees, and patterns in connection with problems on sequences. A. Grammar-Based Coding Grammars are the key data structures employed in the correspondence paper by He and Yang [18] and the paper by Kieffer

0018-9448/04$20.00 © 2004 IEEE

1386


and Yang [25] appearing in this issue. Specifically, these papers concern grammar-based lossless source coding, in which the data sequence to be compressed is represented as a context-free grammar whose language consists uniquely of that data sequence. We first discuss the He–Yang correspondence paper [18]. This paper reconsiders the irreducible grammar-based lossless source codes introduced in [24] from a new point of view. We put forth a general framework useful for understanding the contribution of this paper. Let be a finite source alphabet, and be a complete prefix set of nonempty finite strings over let . (Prefix means no word in is a prefix of any other word in , and complete means every infinite sequence from has a has a unique parsing prefix in .) Then, each string in where each for belongs to and has no proper prefix in . One can attempt to define a th-order empirical entropy of relative to as

where, for some finite set ments are true.

with elements, the following state-

• The minimum is over all , and the joint summation in which each belongs to . is over all •

represents a conditional probability function defined for , in and in (with denoting the summation of over all such that is a prefix of ).

There is something problematic with the definition of : the conditional probability functions that are allowed must be restricted appropriately. For one thing, as the length of goes to infinity, it is desired that the number of phrases also goes to inshould finity. Also, the effect of the last phrase on not be too great. (The precise manner in which the functions must be restricted will be clear to the reader from a reading of [18].) Having appropriately defined , one can define the th-order maximal redundancy of grammar-based codes relative to this empirical entropy as where the outer maximum is over codeword length functions of the irreducible grammar-based codes that were introduced in the paper [24]. He and Yang prove that

for all positive integers , where is the prefix set associated with the run-length parsing of strings over . This result is stronger than the grammar-based coding theorem proved in [24]. We explain why this is so. Suppose we as , where ranges over interpret “reference sources.” Then, we can say a class of alphabet that He and Yang have extended the class of reference sources from the class of -state finite-state sources on (the class of sources considered in [24]) to a class of sources including all -state renewal processes on . The information theory associ-

ated with renewal processes has been of much interest in recent years [9], [13]; the correspondence paper of He and Yang provides a useful addition to that literature. As a final remark, we note that the coding theorem proved in [18] very likely extends than the run-length one. By alto other complete prefix sets to range over other complete prefix sets, one obtains lowing classes of reference sources which follow a “variable-length shift” model; variable-length shift source models were considered in the late 1970s and early 1980s in the development of the theory of asymptotically mean stationary processes ([15, Example 6]). Like the He–Yang correspondence paper [18], the Kieffer– Yang paper [25] also uses context-free grammars as the key data structure. The difference between these two papers is that [25] designs a new class of grammar-based codes, instead of reconsidering a previously considered class. We spell out the background needed to understand [25]. Let be a data sequence over the finite alphabet , and let be a sequence over the finite alphabet which is a coarsening of , meaning that there is a function from into that carries into when applied to coordinate by coordinate (in obtaining coarsenings of different data sequences, this “coarsening function” is allowed to change arbitrarily). It is assumed that an encoder possesses and knows , and that a decoder possesses and knows only . The lossless refinement source coding problem is the problem of determining a universal procedure via which the encoder sends code bits to the decoder that enable the decoder to “refine” and obtain . Kieffer and Yang [25] determine such a universal procedure via a grammar-based approach, which we can summarize as follows. The encoder first represents via a context-free grammar such that the language generated by is . Secondly, the bits which enable the decoder encoder sends the decoder representing for which to build a context-free grammar into that preserves the prothere is a mapping carrying duction rule structure. Finally, the encoder sends to the decoder some additional bits (these are the key bits generated by the alto obtain , gorithm) which the decoder uses to “refine” from which the decoder “grows” . There is introduced a notion of the conditional entropy of the grammar given the , which proves to be approximately equal to the grammar number of key bits that are transmitted. This grammar-based redundancy per algorithm is universal, yielding sample with respect to any finite class of finite-state lossless refinement compression schemes when maximized over all pairs ; no lossless refinement compression scheme other than the one in [25] is presently known which has redundancy per . The Kisample performance decaying faster than effer–Yang scheme also has implications for lossy refinement source coding [28], [11]; each refinement coding step that takes place in a lossy refinement source code can be regarded as a lossless refinement coding step with respect to the data reconstructed on this step, which implies that this step can just as well be performed using the Kieffer–Yang scheme. B. Compression of Words Constrained by a Graph Let , let

be a finite source alphabet, and for each positive integer be an equivalence relation on . We suppose that


the family of equivalence relations satisfies the following subadditivity property: For each pair of positive integers , , if , are equivalent under and , are equivalent under , then , are equivalent under . (The operation denotes left-to-right concatenation of strings.) Suppose -valued random variables are generated by a stationary ergodic source with alphabet , and for each , let be the random variable which is the equivalence class of . First Körner [27] and then Kieffer [21], [23] laid down successive steps in a source ; in this theory, coding theory for the sequence of variables one reconstructs at the decoder not but its instead, or, equivalently, one reconequivalence class structs one fixed sequence belonging to this equivalence class. One consequence of this theory is that the asymptotic equiparti, namely, there tion property (AEP) holds for the sequence (less than or equal to the entropy is a nonnegative constant rate of the given stationary ergodic source) such that almost surely

(3)

is the usual probability function which takes the where if takes equivalence class as its value. value (This AEP is a corollary of Theorem 2 of [22].) Motivated by the to be the length Kolmogorov complexity concept, define of the shortest binary program that can be run on a universal . In the paper [40] in this issue, Savari computer to yield proves that almost surely

(4)

This is a universal coding result, since the left-hand side of (4) does not depend on the source (although the universal code given by this result is nonconstructible). Alternatively, the result can be proved either by relating the left-hand side of (4) to the left-hand side of (3); or by referring to the paper [47]. to a specific set of equivalence reBy specializing lations motived by computer science considerations, Savari is or even to compute ; sometimes able to find bounds for this is the major contribution of [40]. In this part of Savari’s work, a finite undirected graph with vertex set is used, and are declared to be equivalent under two blocks , from if there is a finite sequence of blocks from , starting with and ending with , such that each block after the first one is obtained by transposing two consecutive entries of the preceding block that form an edge in . The structure of the graph dictates what Savari is able to learn about the value of (which is called “interchange entropy” in this context). C. Faster Tree Source Coding A recursively generated conditional probability function gives rise to an arithmetic coder which may yield efficient compression but be slow in a standard implementation. We suppose that the conditional probability functions arise from a tree source model. For encoder/decoder processing of the current source sample, one starts at the root of the tree with the first past source sample and follows a path generated by moving backward in the past until a tree leaf is reached, which is labeled by a state that determines the conditional probability function to be used for the current sample. Once the current sample is processed, the next source sample then plays the role

1387

of “current sample,” and its past is obtained by appending the previous current sample to the beginning of the previous past, the tree then again being used to find the conditional probability function that is now to be used by following a root-to-leaf path. This procedure is slower than it need be because the paths followed in the tree (the “contexts”) in processing one source sample to the next have a common segment that is traversed twice. A more computationally efficient method could be devised which would achieve a speedup by taking advantage of this common segment. Martín et al., in their paper [30] in this issue, obtained a faster implementation for tree source coding. To accomplish this, they use a combination of techniques involving several key ideas, some of which are • the use of a finite-state machine (called the FSM closure of the tree model) to model the above-described transition between contexts; • the compactification of the tree in which some edges can be labeled with a string consisting of more than one letter from the source alphabet; and • the use of suffix links, a tool used in suffix tree generation methods (such as in Ukkonen’s algorithm; see [42] and [16, Ch. 6]). Martín et al. [30] are thus able to obtain linear encoding/decoding time in the semipredictive approach to the Context Algorithm [45], an algorithm previously established to be a universal lossless coding scheme. D. Universal Compression Over an Unknown Alphabet A pattern is a sequence whose first entry is , whose second distinct entry is (reading left to right), whose third distinct entry is , etc. For example, the patterns of length three are Let be a sequence of length over any alphabet. Let be the number of distinct symbols appearing in , i.e., is the cardinality of the set . Then, there is a natural pattern of length with alphabet (called the pattern induced by ) which reflects the relative positions of the entries of . For example, the pattern induced by the is . In general, sequence it is impossible to losslessly encode the set of all sequences of length over an arbitrary alphabet (because the alphabet may be uncountable and there are only countably many binary codewords that can be used by the lossless encoder). However, it is possible to losslessly encode the set of patterns induced by these sequences, because this is a countable set. Each memoryless probability distribution on the set of sequences of length over an arbitrary alphabet can be carried over to a probability distribution on the set of patterns of length by the function which maps each sequence into its induced pattern. In the paper bound by Orlitsky et al. in this issue [35], a remarkable is proved on the worst case block redundancy of this family of probability distributions on the set of patterns of length , and a linear time sequential lossless pattern coder achieving worst is given. These strong case block redundancy bound results suggest the following open questions for future research. • Can a linear time pattern coder be found yielding worst (or better)? case block redundancy

1388


• What are the implications for grammar-based coding? (Grammar-based codes encode the concatenation of the right sides of the production rules of a grammar representing a data sequence; the pattern of is the key feature of since consists mostly of variables of the grammar rather than terminal symbols and it does not matter what notation is used to denote these variables—they are “dummy variables.”) IV. COMPLEXITY ISSUES A. NP-Hard Lossless Source Coding Problems Garey et al. [14] showed that optimal lossy source coding can be an NP-complete problem. In this issue, Mitzenmacher [33] investigates how hard certain lossless source coding problems are. Specifically, he shows that the following problems are NP-hard: • Given a finite set of data files, find the pair of Huffman dictionaries that minimizes the total number of code bits in the compressed files taken together. • Find multiple preset dictionaries for LZ ’77 coding [48]. B. Computational Complexity in Image Retrieval Databases of various types (e.g., text-only databases, multimedia databases, bioinformatics databases) have proliferated in recent years. One of the concerns of computer science is the efficient management of tasks associated with databases. Retrieval of data from a database is one such task. For text-only databases, there are some well-established approaches to the data retrieval problem. On the other hand, for multimedia databases, more research into data retrieval methods needs to be done. Vasconcelos, in the paper [44] in this issue, concentrates on the problem of image retrieval. A probabilistic formulation of the image retrieval problem is employed, in which there are finitely many database image classes, each governed by a probabilistic model, as well as a query image class governed by a probabilistic model. In this framework, image retrieval in response to a query or sequence of queries can be regarded in a Bayesian decision theory context. Maximum a posteriori probability retrieval (MAP retrieval) is the retrieval method yielding the minimum probability of image retrieval error. Unfortunately, implementation of the MAP method involves an optimization that might entail a high degree of computational complexity. The paper [44] addresses this problem by deriving computationally efficient ways to evaluate the MAP function. Kullback–Leibler divergence is a principal tool in the analysis. It is interesting that Kullback–Leibler divergence plays a role in this computer science problem as well as in information theory, exhibiting another instance of the interface with which this issue is concerned.

data sequence that is corrupted by channel noise; the MAP decoder output is chosen as that data sequence which is most likely to have been transmitted, given the observed channel output sequence. The correspondence paper [46] of Wu et al. in this issue is concerned with reducing the computational complexity associated with a MAP decoder. Specifically, the authors address the problem of MAP decoding of Markov sequences transmitted over a memoryless binary channel that produces independent substitution and erasure errors. In this context, MAP decoding becomes the problem of finding a longest path in a weighted acyclic directed graph. A matrix formulation of this computational problem is presented, which, for a Gauss–Markov input sequence, reduces the computational complexity associated with the MAP decoder. The authors point out that their algorithmic approach, whose origins can be traced back through the computer science literature, could be potentially useful in other MAP decoding problems in communication engineering. D. Program Plagiarism Detection Systems The originators of software programs need to have good ways of detecting whether someone has modified one of their programs. A system for accomplishing this task is called a program plagiarism detection system. Chen et al. in their correspondence paper [5] in this issue, put forth a method for designing such a system. The idea is to use a nonnegative distance function defined for pairs of programs, such that as becomes closer to zero for two programs , , the more likely it is that one of the two programs is a plagiarized version of the other. was put In an earlier work [6], such a distance function forth, based on Kolmogorov complexity. The distance function was shown to satisfy a universality property in that it minorized computable distance functions of a certain type (see [5] for the precise sense of this minorization). Roughly speaking, is such that is if a computable distance function small for a pair of programs , , then will also be small; that is, if any distance measure detects plagiarism, then . This universality result indicated that the disso does should be the best distance function to tance function use in detecting program plagiarism. Unfortunately, the distance is noncomputable (because Kolmogorov comfunction plexity is noncomputable). In [5], a way around this problem is is approximated by a computable investigated in which distance function. (A compression algorithm is used to heuristically approximate Kolmogorov complexity in the definition of .) Using this computable distance function, Chen et al. design a practical plagiarism detection system; results of experiments are reported in [5] that show advantages of this system over other practical plagiarism detection systems. V. ESTIMATION/PREDICTION IN PROBLEMS ON SEQUENCES

C. Computational Complexity of MAP Decoding

A. Entropy Estimator Based on Block Sorting

As we have just discussed, the paper of Vasconcelos is concerned with the MAP method in image retrieval. The MAP method is also a well-known method in communication engineering, where it is called MAP decoding. MAP decoding is implemented at the output of a channel to estimate a transmitted

The correspondence paper by Cai et al. [4] in this issue presents a sequence of entropy estimators which operate on longer and longer initial segments of an infinite individual sequence over a finite alphabet . A universal estimation result is proved in the following sense: almost surely, when applied


to the sequence of random outputs generated by any stationary ergodic source with alphabet , the proposed sequence of entropy estimators yields entropy estimates converging to the entropy rate of the source. Specializing to a stationary ergodic finite-state Markov source, a result on the speed of convergence of the entropy estimates is obtained. The Burrows–Wheeler Block Sorting Transform [3] is a key feature of these entropy estimators: the entropy estimate based on an initial source segment is formed in a natural way from the sequence which is the Burrows–Wheeler transform of that segment. A question not addressed in [4] is how the performance of the universal sequence of entropy estimators of this correspondence paper compares to the performance of other universal sequences of entropy estimators. This is an open question for the future. B. Pseudorandomness Via Prediction Theory The question concerning whether one can construct a truly random individual sequence has attracted the interest of many people (see [26], [29, Sec. 1.5.1], and the references in [34]). The answer to the question (due to Kolmogorov complexity considerations) is that the very fact that a pseudorandom sequence is constructible dictates that the sequence can obey only a limited number of randomness properties. Within this limitation, many constructions of pseudorandom sequences have been proposed. Prediction theory has been one tool that has been used to obtain pseudorandom individual sequences. For example, Ehrenfeucht and Mycielski [10] construct an infinite pseudorandom binary sequence by taking each entry to be the complement of what would be predicted by a certain universal predictor applied to the previous entries; the pseudorandomness properties of their sequence have not, as of this date, yet been completely worked out. Nobel, in his paper [34] in this issue, also investigates the problem of pseudorandom individual sequence construction from the prediction theory point of view. He formally defines a memoryless sequence to be an individual sequence of real numbers on which no continuous Markov prediction scheme can outperform the best constant predictor under squared-error loss. He then shows that memoryless sequences obey asymptotic behavior that we would expect of “typical sequences” generated by a memoryless information source (a law of large numbers and a central limit theorem are established). He also discusses the construction of memoryless sequences. C. Resource-Bounded Universal Prediction Universal prediction is a research area that began in the 1950s—the paper [31] provides a good account of its history. Information theorists, statisticians, and computer scientists have been inspired to work in this area [2], [17], [8], [36], [1], [19]. The paper by Meron and Feder [32] in this issue concerns the universal prediction theory of individual sequences, a theory which received great impetus from the award-winning paper [12]. We give some background concerning universal prediction so that the contributions of the paper [32] can be placed in context. A binary predictor may be thought of as a mapping which assigns to each finite (possibly empty) binary string a . Given binary probability distribution of length any binary individual sequence , the binary predictor successively forms a prediction

1389

of each entry , where is a binary-valued random variable ( is the string formed by the having distribution , being the empty string). The past entries Hamming prediction loss in prediction of the entries of via predictor is defined by

If

is a nonnegative integer, then the predictor is said to be an th-order predictor if whenever the binary have the same suffix of length . (In particular, a strings th-order predictor is a predictor for which all the probability distributions are the same.) The th-order binary predictors for each all belong to the more general class of preto dictors called finite-state binary predictors. We define over all th-order predictors ; thus, be the infimum of gives the optimum Hamming prediction loss in applying th-order predictors to the individual sequence . A predictor is said to be a universal predictor for binary individual sequences with respect to the family of th-order predictors if (5)

The left-hand side of (5) is called the th-order regret for predictor (on individual sequences of length ). Previous research in universal prediction theory (surveyed in [31]) had shown the existence of universal predictors, and had examined the speed of decay of the th-order regret for a universal predictor . However, a universal predictor is not a finite-state predictor. It could be useful to see how small the th-order regret can become as for a predictor which is resource bounded in the sense that it is a finite-state predictor governed by a fixed number of states —this is the problem investigated in [32]. We list some of the results obtained in [32]. i) A -state finite-state predictor is constructed with respect to the class of th-order predictors, and its th-order regret on binary individual sequences of length is shown to be upper-bounded for sufficiently , an explicit positive constant delarge by pending only on and . ii) An explicit positive constant depending only and is put forth for which it is shown that on any -state predictor on binary individual sequences has th-order regret lower-bounded by of length for sufficiently large. iii) The th-order regret of the -state predictor constructed with respect to the th-order predictors converges to as , and this is the optimum asymptotic th-order regret performance among all -state finite-state predictors. Similar results are obtained for prediction loss functions other than the Hamming loss function. It is an interesting open problem to see whether the results of [32] can be extended from a binary alphabet to any nonbinary finite alphabet (this problem seems hard, since many of the techniques of [32] seem not to extend readily to nonbinary alphabets).

1390


VI. CONCLUSION An incomplete list of topics follows illustrating some of the ways in which information theory (IT) and computer science (CS) impact each other. • Analysis of algorithms applied to analyze algorithms for data compression, prediction, pseudorandom number generation, and classification. • Computational or descriptive complexity applied to examine the complexity/performance tradeoff in lossless/lossy data compression. • CS data structures (grammars, trees, and graphs) applied to design codes for sequences. • Exact and approximate pattern-matching techniques applied to context modeling/coding. • Entropy, mutual information, and information distance concepts applied to problems of computational biology and bioinformatics. • IT concepts applied to provide bounds in computational learning theory. • Applications of IT to computational linguistics. • Applications of IT to information retrieval (especially data mining). Many of these topics are well represented in the papers in this special issue. The reader who reads elsewhere will find examples of papers in other journals categorizable as being in the IT/CS interface. The range of topics at the interface is likely to continue to grow with time, driven by society’s need to handle data more efficiently and more rapidly. Of course, one area at the interface that is going to continue to undergo an explosion in growth for many years to come is bioinformatics. In summary, researchers who choose the IT/CS interface as their research area will find a wealth of interesting problems. JOHN C. KIEFFER University of Minnesota Department of Electrical and Computer Engineering Minneapolis, MN 55455 USA WOJCIECH SZPANKOWSKI Purdue University Department of Computer Science W. Lafayette, IN 47907 USA EN-HUI YANG University of Waterloo Department of Electrical and Computer Engineering Waterloo, ON N2L 3G1 Canada REFERENCES [1] P. Algoet, “Universal schemes for prediction, gambling and portfolio selection,” Ann. Probab., vol. 20, pp. 901–941, 1992. [2] D. Blackwell, “Controlled random walk,” in Proc. 1954 Congress of Mathematicians. Amsterdam, The Netherlands: North-Holland, vol. III, pp. 336–338. [3] M. Burrows and D. Wheeler, “A block-sorting lossless data compression algorithm,” Digital Systems Res. Ctr., Tech. Rep. 124, 1994. [4] H. Cai, S. Kulkarni, and S. Verdú, “Universal entropy estimation via block sorting,” IEEE Trans. Inform. Theory, vol. 50, pp. 1551–1561, July 2004.

[5] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared information and program plagiarism detection,” IEEE Trans. Inform. Theory, vol. 50, pp. 1545–1551, July 2004. [6] X. Chen, M. Li, X. Li, B. Ma, and P. Vitanyi, “The similarity metric,” IEEE Trans. Inform. Theory, to be published. [7] B. Conrad and M. Mitzenmacher, “Power laws for monkeys typing randomly: The case of unequal probabilities,” IEEE Trans. Inform. Theory, vol. 50, pp. 1403–1414, July 2004. [8] T. Cover, “Behavior of sequential predictors of binary sequences,” in Trans. 4th Prague Conf. Information Theory, Statistical Decision Functions and Random Processes, Prague, Czechoslovakia, 1965, pp. 263–272. [9] I. Csiszár and P. Shields, “Redundancy rates for renewal and other processses,” IEEE Trans. Inform. Theory, vol. 42, pp. 2065–2072, Nov. 1996. [10] A. Ehrenfeucht and J. Mycielski, “A pseudorandom sequence—How random is it?,” Amer. Math. Monthly, vol. 99, pp. 373–375, 1992. [11] W. Equitz and T. Cover, “Successive refinement of information,” IEEE Trans. Inform. Theory, vol. 37, pp. 269–275, Mar. 1991. [12] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of individual sequences,” IEEE Trans. Inform. Theory, vol. 38, pp. 1258–1270, July 1992. [13] P. Flajolet and W. Szpankowski, “Analytic variations on redundancy rates of renewal processes,” IEEE Trans. Inform. Theory, vol. 48, pp. 2911–2921, Nov. 2002. [14] M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of the generalized Lloyd–Max problem,” IEEE Trans. Inform. Theory, vol. IT-28, pp. 255–256, Mar. 1982. [15] R. Gray and J. Kieffer, “Asymptotically mean stationary measures,” Ann. Probab., vol. 8, pp. 962–973, 1980. [16] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge, U.K.: Cambridge Univ. Press, 1997. [17] F. Hannan, “Approximation to bayes risk in repeated plays,” in Contributions to the Theory of Games. Princeton, NJ: Princeton Univ. Press, 1957, vol. III, Annals of Mathematics Studies no. 39, pp. 97–139. [18] D.-K. He and E.-H. Yang, “Performance analysis of grammar-based codes revisited,” IEEE Trans. Inform. Theory, vol. 50, pp. 1524–1535, July 2004. [19] P. Jacquet, W. Szpankowski, and I. Apostol, “A universal predictor based on pattern matching,” IEEE Trans. Inform. Theory, vol. 48, pp. 1462–1472, June 2002. [20] P. Jacquet and W. Szpankowski, “Markov types and minimax redundancy for Markov sources,” IEEE Trans. Inform. Theory, vol. 50, pp. 1393–1402, July 2004. [21] J. Kieffer, “Block coding for an ergodic source relative to a zero-one valued fidelity criterion,” IEEE Trans. Inform. Theory, vol. IT-24, pp. 432–438, July 1978. , “An ergodic theorem for constrained sequences of functions,” [22] Bull. Amer. Math. Soc., vol. 21, pp. 249–254, 1989. [23] , “Strong converses in source coding relative to a fidelity criterion,” IEEE Trans. Inform. Theory, vol. 37, pp. 257–262, Mar. 1991. [24] J. Kieffer and E.-h. Yang, “Grammar-based codes: A new class of universal lossless source codes,” IEEE Trans. Inform. Theory, vol. 46, pp. 737–754, May 2000. [25] , “Grammar-based lossless universal refinement source coding,” IEEE Trans. Inform. Theory, vol. 50, pp. 1415–1424, July 2004. [26] A. Kolmogorov and V. Uspenskii, “Algorithms and randomness,” Theory of Probability and its Applications, vol. 32, pp. 389–412, 1987. [27] J. Körner, “Coding of an information source having ambiguous alphabet and the entropy of graphs,” in Trans. 6th Prague Conf. Information Theory, Statistical Decision Functions and Random Processes, Prague, Czechoslovakia, 1973, pp. 411–425. [28] V. Koshelev, “Hierarchical coding of discrete sources,” Probl. Pered. Inform., vol. 16, pp. 31–49, 1980. [29] M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications. New York: Springer-Verlag, 1993. [30] A. Martín, G. Seroussi, and M. Weinberger, “Linear time universal coding and time reversal of tree sources via FSM closure,” IEEE Trans. Inform. Theory, vol. 50, pp. 1442–1468, July 2004. [31] N. Merhav and M. Feder, “Universal prediction,” IEEE Trans. Inform. Theory, vol. 44, pp. 2124–2147, Oct. 1998. [32] E. Meron and M. Feder, “Finite memory universal prediction of individual sequences,” IEEE Trans. Inform. Theory, vol. 50, pp. 1506–1523, July 2004. [33] M. Mitzenmacher, “On the hardness of finding optimal multiply preset dictionaries,” IEEE Trans. Inform. Theory, vol. 50, pp. 1536–1539, July 2004.


[34] A. Nobel, “Some stochastic properties of memoryless individual sequences,” IEEE Trans. Inform. Theory, vol. 50, pp. 1497–1505, July 2004. [35] A. Orlitsky, P. Santhanam, and J. Zhang, “Universal compression of memoryless sources over unknown alphabets,” IEEE Trans. Inform. Theory, vol. 50, pp. 1469–1481, July 2004. [36] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Trans. Inform. Theory, vol. IT-30, pp. 629–636, July 1984. , “Fisher information and stochastic complexity,” IEEE Trans. In[37] formation Theory, vol. 42, pp. 40–47, Jan. 1996. [38] B. Ryabko, “Prediction of random sequences and universal coding,” Probl. Pered. Inform., vol. 24, pp. 3–14, 1988. [39] , “The complexity and effectiveness of prediction algorithms,” J. Complexity, vol. 10, pp. 281–295, 1994. [40] S. Savari, “Compression of words over a partially commutative alphabet,” IEEE Trans. Inform. Theory, vol. 50, pp. 1425–1441, July 2004. [41] Y. Shtarkov, “Coding of discrete sources with unknown statistics,” Colloquia Mathematica Societatis János Bolyai, vol. 16, pp. 559–574, 1977.

1391

[42] E. Ukkonen, “On-line construction of suffix-trees,” Algorithmica, vol. 14, pp. 249–260, 1995. [43] B. Tunstall, “Synthesis of noiseless compression codes,” Ph.D. dissertation, Georgia Inst. Technol., Atlanta, 1967. [44] N. Vasconcelos, “On the efficient evaluation of probabilistic similarity functions for image retrieval,” IEEE Trans. Inform. Theory, vol. 50, pp. 1482–1496, July 2004. [45] M. Weinberger, A. Lempel, and J. Ziv, “A sequential algorithm for the universal coding of finite memory sources,” IEEE Trans. Inform. Theory, vol. 38, pp. 1002–1014, May 1992. [46] X. Wu, S. Dumitrescu, and Z. Wang, “Monotonicity-based fast algorithms for MAP estimation of Markov sequences over noisy channels,” IEEE Trans. Inform. Theory, vol. 50, pp. 1539–1544, July 2004. [47] E.-h. Yang and S.-Y. Shen, “Distortion program-size complexity with respect to a fidelity criterion and rate-distortion function,” IEEE Trans. Inform. Theory, vol. 39, pp. 288–292, Jan. 1993. [48] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inform. Theory, vol. IT-23, pp. 337–343, May 1977.

John C. Kieffer (M’86–SM’87–F’93) was born and raised in St. Louis, MO. In 1970, he received the Ph.D. degree in mathematics from the University of Illinois at Urbana-Champaign. During 1970–1976, he was Assistant Professor at the Department of Mathematics and Statistics, University of Missouri-Rolla. At this same institution, he was Associate Professor during 1976–1980, and was Professor during 1980–1986. Since 1986, he has been at the University of Minnesota, Minneapolis, where he is a faculty member in the Department of Electrical and Computer Engineering and the Control Science and Dynamical Systems Program. He has held visiting positions in the Department of Electrical Engineering. at Stanford University, Stanford, CA (1978–1979), the Department of Electrical Engineering, University of Illinois at Urbana-Champaign (1980, 1984–1985), the Department of Electrical and Computer Engineering, University of Arizona, Tucson (1996–1997), the Department of Electrical Engineering, Swiss Federal Institute of Technology (ETH), Zürich, Switzerland (1997), and the Department of Computer Science, University of Arizona (2001). He has over 70 MathSciNet publications in information theory, ergodic theory, probability theory, and other fields. Dr. Kieffer served on the Program Committee of the 2003 IEEE International Symposium on Information Theory, and is a member of the Mathematical Association of America. W. Szpankowski (M’87–SM’95–F’03) received the M.S. and Ph.D. degrees in electrical and computer engineering from the Technical University of Gdan´ sk, Gdan´ sk, Poland, in 1976 and 1980, respectively. He was an Assistant Professor at the Technical University of Gdan´ sk, and in 1984, he held Visiting Assistant Professor position at the McGill University, Montreal, PQ, Canada. Currently, he is Professor of Computer Science at Purdue University, West Lafayette, IN. During 1992–1993, he was Professeur Invité at the Institut National de Recherche en Informatique et en Automatique, Rocquencourt, France; in fall 1999, he was Visiting Professor at Stanford University, Stanford, CA; and in June 2001, he was Professeur Invité at the Université de Versailles, Versailles, France. His research interests cover analysis of algorithms, information theory, bioinformatics, analytic combinatorics and random structures, pattern matching, discrete mathematics, performance evaluation, stability problems in distributed systems, and applied probability. He has published the book Average Case Analysis of Algorithms on Sequences (New York: Wiley, 2001). He wrote about 150 papers on these topics. Dr. Szpankowski has served as a Guest Editor for several journals: in 2002, he edited with M. Drmota a special issue for Combinatorics, Probability, & Computing on analysis of algorithms. He is on the editorial board of Theoretical Computer Science and Foundation and Trends in Communications and Information Theory. He also serves as the Managing Editor of Discrete Mathematics and Theoretical Computer Science for Analysis of Algorithms. He chaired several workshops: in 1999, the Information Theory and Networking Workshop, Metsovo, Greece; in 2000, the Sixth Seminar on Analysis of Algorithms, Krynica Morska, Poland; and in 2003, the NSF Workshop on Information Theory and Computer Science Interface, Chicago, IL. In June 2004, he will chair the 10th Seminar on Analysis of Algorithms, Berkeley, CA. He is a recipient of the Humboldt Fellowship, and AFOSR, NSF, NIH, and NATO research grants.

1392


En-hui Yang (M’97–SM’00) was born in Jiangxi, China, on December 26, 1966. He received the B.S. degree in applied mathematics from HuaQiao University, Qianzhou, China, and the Ph.D. degree in mathematics from Nankai University, Tianjin, China, in 1986 and 1991, respectively. He joined the faculty of Nankai University in June 1991 and was promoted to Associate Professor in 1992. From January 1993 to July 1993, and from January 1995 to August 1995, he was a Research Associate in the Department of Electrical and Computer Engineering at the University of Minnesota, Minneapolis-St. Paul. During summer (July 1 to August 31) 1994, he was a guest of the Sonderforschungsbereich “Diskrete Strukturen in der Mathematik,” University of Bielefeld, Bielefeld, Germany. From October 1993 to May 1997, he was a Visiting Scholar in the Department of Electrical Engineering-Systems at the University of Southern California, Los Angeles. Since June 1997, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada, where he is now a Professor and Canada Research Chair in information theory and multimedia compression. He is now on sabbatical in the Department of Information Engineering, the Chinese University of Hong Kong, Hong Kong. His current research interests are: multimedia compression, multimedia watermarking, multimedia transmission, digital communications, information theory, Kolmogorov complexity theory, source and channel coding, quantum information theory, and applied probability theory and statistics. Dr. Yang is a recipient of several research awards including the 1992 Tianjin Science and Technology Promotion Award for Young Investigators, the 1992 third Science and Technology Promotion Award of the Chinese National Education Committee, the 2000 Ontario Premier’s Research Excellence Award, Canada, the 2000 Marsland Award for Research Excellence, University of Waterloo, and the 2002 Ontario Distinguished Researcher Award.