Statistical Inference Under Multiterminal Data Compression ...

15 downloads 0 Views 1MB Size Report
6, OCTOBER 1998. Statistical Inference Under. Multiterminal Data Compression. Te Sun Han, Fellow, IEEE, and Shun-ichi Amari, Fellow, IEEE. (Invited Paper).
2300

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Statistical Inference Under Multiterminal Data Compression Te Sun Han, Fellow, IEEE, and Shun-ichi Amari, Fellow, IEEE (Invited Paper)

Abstract— This paper presents a survey of the literature on the information-theoretic problems of statistical inference under multiterminal data compression with rate constraints. Significant emphasis is put on problems: 1) multiterminal hypothesis testing, 2) multiterminal parameter estimation, and 3) multiterminal pattern classification, in either case of positive rates or zero rates. In addition, the paper includes three new results, i.e., the converse theorems for all problems of multiterminal hypothesis testing, multiterminal parameter estimation, and multiterminal pattern classification at the zero rate. Index Terms—Covariance, Fisher information, hypothesis testing, multiterminal data compression, parameter estimation, pattern classification, probability of error, rate constraint, statistical inference, universal coding.

I. INTRODUCTION

I

N 1979, Berger [1] proposed and formulated an intriguing novel problem to put standard statistical inference problems, such as hypothesis testing and parameter estimation, in the information-theoretic framework of multiterminal data compression schemes. This was the first attempt to combine two seemingly different kinds of problems which had been separately investigated in the void between statistics and information theory. Needless to say, statistics and information theory had been separate from one another. Their fundamental approaches and methodologies were substantially different in nature, although both revealed intrinsic features of the same “information” in its wider sense. The true significance of the multiterminal statistical inference system of Berger, as well as the single-user universal coding system of Rissanen [2], is that those systems enabled us to incorporate problems of statistical inference into information theory. The basic system that Berger has proposed is the following. be any finite sets. Given a family Let

of joint probability distributions on , indexed by a ( is a prescribed appropriate set (finite or parameter infinite)), we may consider the following multiuser situation. Manuscript received May 1, 1998. T. S. Han is with the Graduate School of Information Systems, University of Electro-Communications, Chofugaoka 1-5-1, Tokyo 182-8585, Japan. S. Amari is with the Brain-Style, Information Systems Group, RIKEN Brain Science Institute, Hirosawa, 2-1, Wako-shi, Saitama 351-0198, Japan. Publisher Item Identifier S 0018-9448(98)05283-3.

Let data (1.1) (1.2) be generated at two separate remote sites of blocklength A and B, respectively, where pairs are independently and identically distributed (i.i.d.) subject to a with an unknown parameter Here, joint distribution and are correlated to each other In the usual situation of statistics, one makes statistical inference about the actual parameter under the assumption that these are fully available. In many practical situations, data however, this is not necessarily the case, and rather, it would be usual that we impose some limitations on the capacity for from sites A and B to transmitting the generated data the statistician at another site C, or other kinds of limitations such as the precision for numerical expression of those data, and so on. It then becomes unavoidable to incorporate the into the form operation of compressing the generated data and ( and are called the encoders) at of sites A and B, respectively, which are in turn transmitted to a common information-processing center at site C. It should can observe only be emphasized here that the encoders and , respectively. The center is then required to make an optimal statistical estimation of the true value (Fig. 1) in the ( is called the decoder). form of In many cases, the capacity restriction is expressed in the of the ranges of the encoder form that the sizes are upper-bounded asymptotically as functions (1.3) for sufficiently large , where nonnegative constants are called the rates for the encoders and , respectively. we are allowed to choose any Here, given rates as long as rate constraint (1.3) is encoder functions satisfied. The basic question posed in the framework of such a be multiuser coding system is: How should the encoders constructed under rate constraint (1.3) in order for the decoder at the center to yield an optimal effective estimator (1.4) for the parameter ? In the present paper we shall concentrate on this basic question as well as on several related problems of statistical testing, estimation, and classification.

0018–9448/98$10.00  1998 IEEE

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

2301

Fig. 1. Multiterminal system for statistical inference.

It should be noted here that in this system, in contrast to the case of usual data compression systems, the whole data is not necessarily required to be reliably reproduced at the center . Instead, we are concerned only with reliably reproducing the value of an unknown parameter governing If rates are large enough, the generation of data and then can the center can have access to the full data make an optimal estimator based on this full data The problem then reduces to the well-established standard are not statistical inference problem. However, if rates at the large enough to reliably reproduce the full data using only partial center, we have to construct an estimator This is the main reason information about the data why the problem of statistical inference under multiterminal data compression is made so drastically different from that of traditional multiterminal data compression. Information theory are not large enough (even though tells us that, if rates they are both positive), the probability of decoding error necessarily approaches one as blocklength tends to infinity. This makes no sense from the viewpoint of traditional lossless or lossy data compression. Even in this situation, however, we with high reliability. can make an effective estimator To make the point clearer, let us consider as a special case of (1.3) the following rate constraint: and/or (1.5) These rate constraints are called the zero-rate compression. This means that the zero-rate encoders carry asymptotically negligible informations alone, which is of no use in the traditional data compression but is still in general of significant use in constructing an effective estimator Thus the zero-rate data compression becomes one of the main important subjects on statistical inference in the multiterminal framework. Therefore, in the subsequent sections we shall address also the details of the zero-rate statistical inference problem as a subject of independent interest in its own right. The problem of statistical inference under multiterminal data compression is ramified into three main branches depending on the choice of the parameter set and the manners of statistical decision. First, we consider the case in which consists of only two , where and are called the elements, say,

null hypothesis and the alternative hypothesis, respectively. Then, we can write as (1.6) In this case, according to the where tradition in statistics, we define two kinds of error probabilities as follows: (1.7) (1.8) and are independent and identiwhere cally distributed random variables of length subject to the on , respectively, and joint probability distributions are called the error probability of the first kind and the error probability of the second kind, respectively. Incidentally, the region defined by

is called the acceptance region. If then the decoder decides that is correct; otherwise, that is correct. In this case, our purpose is, given rate constraint (1.3), to as design a statistical inference system so as to make small as possible when a prescribed upper bound on is imposed. We notice here that the error probability usually decays exponentially fast with blocklength This leads us to formulate and study the problem of multiterminal hypothesis testing. This is sometimes called also the problem of distributed detection. The above setting is the case in which the alternative hypothesis is simple. A more general case is the hypothesis where testing with a composite alternative hypothesis includes a (possibly infinite) number of distributions. Since our theory is asymptotic, composite alternative cases reduce and , to the simple hypothesis testing with where (1.9) and

is the Kullback–Leibler divergence defined by (1.10)

and However, in the typical composite case with , such a reduction to the simple alternative

2302

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

case does not make sense. In order to treat this typical case in a reasonable way, techniques developed in the estimation problem are useful. Another problem is the pattern classification or discriminatypically consists of a finite tion, where the parameter set of pattern classes: number Each class is associated with probability distribution on The problem here is to decide, on the and , which basis of the encoded informations belongs to. The statistician class the generated data tries to design the decoder so that the overall average classification error probability be minimized. This is called the problem of multiterminal pattern classification or the problem of distributed pattern classification. On the other hand, we may consider also the case in which is an open subset of the -dimensional the parameter set is a sufficiently Euclidean space and the joint distribution Then, smooth function of the parameter for the parameter can be written as an estimator (1.11) where

is an independent and identically distributed random variable subject to joint probability distribution of blocklength on In this case, our purpose is, given rate constraint (1.3), to design a statistical inference system so as to make the as small as possible, where is requested (co)variance of in most cases to be asymptotically unbiased. Usually, the is asymptotically (co)variance of an effective estimator Thus we are proportional to the inverse of blocklength lead to formulate and study another kind of problem, that is, the problem of multiterminal parameter estimation. This is sometimes called also the problem of distributed estimation. These three areas of multiterminal hypothesis testing, multiterminal classification, and multiterminal parameter estimation are closely related to each other and they have the nice structural correspondence not only at the conceptual level but also at the technical level, where the notion of divergence (or Fisher information) plays a key role to bridge these areas. It should be kept in mind that the design of the encoders and the decoder above defined should not depend on the , because the value of actual value of the parameter the actual parameter is unknown to both of encoders and decoder, which is to be evaluated at the decoder based on the encoded data. This means that only universal codings make sense in these multiterminal statistical inference problems. In the present paper we deal with the problem of multiterminal statistical inference thus defined. First, in the next section, we will give a brief historical sketch of this subject, and then, in the subsequent sections, summarize several main results that had been established in this research field up to the present. It should be mentioned here that, to the best of our knowledge, until now, only very few papers have been published in this field. This would be mainly because the problem in

consideration is, in general, of formidable complexity in its own nature (cf. Ahlswede and Csisz´ar [3]) and so it scarcely allows us to reach the so-called single-letter characterization for achievable error exponents in multiterminal hypothesis testing and multiterminal pattern classification or that for achievable covariances in multiterminal parameter estimation, where the term of single-letter characterization is used to denote the computability of the relevant quantity. This means that this research field is not yet mature enough and it remains to be further cultivated. Thus for the sake of further possible developments of this field, it would be useful to try here to summarize several typical earlier results in as compact form as possible but not in ambiguous manner. Sections II–IX are assigned to this purpose, where Section IX includes also new results on the optimality concerning the zero-rate multiterminal pattern classification. Finally, in Sections X and XI, we will present the solutions to two open problems on the optimality (i.e, the converse part) concerning both the hypothesis testing and parameter estimation under zero-rate multiterminal data compression. Specifically, in these sections we will prove that both the zerorate acceptance region given by Han and Kobayashi [5] and the zero-rate parameter estimator given by Amari [13] are optimal. The proof for the former problem is rather simple, whereas that for the latter problem is much harder and needs rather subtle large-deviation techniques as well as basic informationgeometrical considerations. As a by-product, the optimality of the zero-rate pattern classifier of Amari and Han [9] is newly derived (cf. Section IX). II. HISTORICAL SKETCH The problem of multiterminal hypothesis testing with has first been investigated constant-type constraint in 1986, seven years after the original proposal of Berger [1], and Ahlswede and Csisz´ar [3]. They have mainly focused on the problem of hypothesis testing against independence in but with full sidethe case with arbitrary positive rate and succeeded in establishing the information single-letter characterization of the optimal exponent for the error probability of the second kind as a function of rate They have also given a single-letter lower bound (though not very tight) on the optimal exponent for the general hypothesis From the technical point of view, testing with the approach adopted by them to investigate this problem was called the divergence characterization problem. Subsequently, in 1987, Han [4] has studied the problem of more general hypothesis testings with constant-type constraint at arbitrary positive rates and derived a singleletter lower bound on the optimal error exponent by reducing the problem to that of minimizing the relevant divergence, is much which with reference to the case of tighter than that of Ahlswede and Csisz´ar [3]. In particular, Han has first introduced the problem of hypothesis testing with one-bit data compression and established the single-letter characterization of the optimal error exponent for this one-bit compression system, where it was revealed that the exponent is still generally positive even only with one-bit information.

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

In 1989, Han and Kobayashi [5] have considered the same hypothesis testing problem with exponential-type constraint ( is any constant) to obtain the results paralleling those in Han [4], including also the case of general zero-rate data compression in addition to the case of one-bit compression. In 1992, Shalaby and Papamarcou [6] have refined the result of Han [4] for the one-bit compression system and extended it to the case of general zero-rate compression, where it is interesting to see that the converse part was proved by the ingenious use of “Blowing-up lemma” of Ahlswede, G´acs, and K¨orner [7] (also, see Marton [8]). It will turn out that this lemma plays a key role also in establishing the converse part for the zero-rate hypothesis testing problem with exponentialtype constraint (see Section V) as well as that for the zero-rate parameter estimation problem (see Section VII). Shalaby and Papamarcou [6] have also studied the composite hypothesis testing problem under zero-rate data compression. In 1989, Amari and Han [9] have shown an informationgeometrical approach to the zero-rate hypothesis testing problem to demonstrate that Pythagorean theorem on divergences plays a crucial role in formulating and studying this kind of problems. They have also studied the zero-rate multiterminal statistical classification problem. In 1994, Shalaby and Papamarcou [10] have generalized the zero-rate hypothesis testing problem to the case where the source is a correlated Markov process to obtain several non-single-letterized results paralleling those in their previous paper [6]. In 1994, by taking into consideration also the aspect of coding error probabilities for the first time, Han, Shimokawa, and Amari [11] have established a significantly tighter singleletter lower bound on the optimal error exponent for the multiterminal hypothesis testing with full side-information at arbitrary positive rate , where they have also shown that their lower bound coincides with the optimal error exponent at higher rates Now let us turn to the multiterminal parameter estimation problem. The first result in this area has been given in 1988 by Zhang and Berger [12] who have considered the problem of parameter estimation with one-dimensional parameter at arbitrary positive rates and demonstrated the existo establish tence of an asymptotically unbiased estimator a single-letter upper bound on the minimum variance that can Their result, however, be attained by the optimal estimator was very restrictive, because in doing so they have imposed a stringent condition, called the additivity condition, on the of joint distributions, which is the property family that is not preserved under parameter transformations. On the other hand, from the information-geometrical point of view, in 1989, Amari [13] has studied the parameter estimation problem under zero-rate data compression, where he has constructed a very simple asymptotically unbiased by using only the marginal types, effective estimator and of the generated data together It will be with the explicit form for the covariance of has the minimum shown in Section XI that the estimator achievable covariance under zero-rate data compression. It

2303

should be pointed out that the estimator of Amari has been derived by solving the maximum-likelihood equation based on the calculation of the joint probability of the statistic (2.1) asymptotically coincides with the so that the covariance of inverse of the Fisher information matrix of Subsequently, in 1990, Ahlswede and Burnashev [14] have considered the minimax estimation problem for a special but important case with one-dimensional parameter and full side-information such that the marginal of the joint probability is independent of the This assumption was used to avoid technical parameter difficulties due to universal coding as mentioned in Section I, that is, one can dispense with universal coding techniques, from With this owing to the assumed independence of special multiterminal estimation system they have established a limiting (not single-letterized) formula for the optimal minas a function of rate , which imax variance index is given by the inverse of the related maxmin states that Fisher information index In 1995, Han and Amari [15] have attempted to generalize the zero-rate estimation scheme of Amari [13] so as to make it applicable to the general multiterminal estimation ) worksystem (with a multidimensional parameter They, like in Zhang ing at arbitrary positive rates and Berger [12], have introduced auxiliary random variables such that forms a Markov chain in this order and constructed an asymptotically unbiusing only the marginal joint ased effective estimator where are the types encoded versions of the auxiliary data generated according given data to conditional joint probability This estimator has been derived, like in Amari [13], by solving the maximum-likelihood equation based on the calculation of the joint probability of the statistic (2.2) asymptotically coincides so that the covariance matrix of with the inverse of the Fisher information matrix of The form of the covariance of thus constructed is invariant under parameter transformations. It was also shown specialized to the onethat the variance of the estimator is in general substantially dimensional parameter case of Zhang and Berger smaller than that of the estimator [12] when both estimation systems are working at the same However, the covariance matrix of does rates not seem to be the best possible in general at arbitrarily given rates Thus far, we have briefly summarized “almost all” of the presently existing results in the fields of multiterminal hypothesis testing, multiterminal classification, and multiterminal parameter estimation. However, are they all? Why is it? Although we can say that the one-bit and/or zerorate compression case has been completely solved (cf. Han [4], Shalaby and Papamarcou [6]) or is to be solved in Sections IX–XI of the present paper, problems in the general

2304

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

is fully observed by the statistician, and define the encoder and the decoder by

Fig. 2. Single-user system for statistical inference.

for for

(3.1) (3.2)

positive rate compression case are very intractable and many of them still remain open, except for the only special case of hypothesis testing against independence with full side(cf. Ahlswede and Csisz´ar [3]). Does information this mean that the general positive rate problem affords in its own nature no single-letter characterization at all? The approach of divergence characterization by Ahlswede and Csisz´ar [3] seems to be hopeless in establishing the direct part, whereas the information-geometrical approach by Han and Amari [15] seems to be insufficient in establishing the converse part. There still exists a substantial gap between these two kinds of approaches. In this connection, we would like to remind the reader that the present problem is lying in the void between statistics and information theory and so researchers in this field need to be knowledgeable about basic elements in both of statistics and information theory. In this sense, we cannot say that enough efforts have already been devoted to investigate this subject. In addition to these difficulties, it is very likely that the subtle problem of ancillary statistics (cf. Rao [16]) may be one of the main obstacles to the final goal (also, cf. K¨orner and Marton [17], Han and Kobayashi [18]). Are these difficulties to be overcome in the near future? We believe that there exists a way of eluding the difficulties. Anyway, nobody knows about the certain future! So far we have briefly summarized several informationtheoretic approaches to the problem of statistical inference under multiterminal data compression. It should also be mentioned here that in the area of communication theory the bulk of related works have already been accumulated, based on various kinds of rather non-information-theortetic approaches. They include, for example, [25]–[47]. In the present paper, however, we have no space to review these works. III. SINGLE-USER SYSTEM FOR STATISTICAL INFERENCE Before going into the details of the problem of multiterminal statistical inference, it would be helpful to the reader to think first about the single-user system of statistical inference in order to get some preliminary insights in this kind of problems. The single-user statistical inference system is formulated be a finite set. Given a family as follows (Fig. 2). Let of probability distributions on , suppose was generated that an i.i.d. data with an unknown at site A according to probability The encoder at site A maps data into the parameter and then the decoder at site C maps the encoded form into the form data First, let us consider the problem of hypothesis testing, Then, we can write as i.e., the case where where Let be the Neyman–Pearson acceptance region available when data

With this coding scheme the center at site C can exactly or not. This means that recognize whether data is in is enough to achieve the one-bit information, i.e., same optimal hypothesis testing as in the usual situation in which data is fully available. The pattern classification problem assumes that there are pattern classes each generating data subject on Hence, to probability distribution and the problem is to classify data in one of the pattern classes so that the average classification error probability is minimized. The statistician can make the Define the encoder optimal Bayes classification based on and the decoder by when the Bayes decision is

(3.3) (3.4)

information, irrespective of , is This means that sufficient to achieve the optimal pattern classification. Thus the zero-rate classification achieves the same optimal performance as in the case with full data available. Next, consider the problem of parameter estimation. We by define the encoder (3.5) maps the encoded data into the and the decoder Then, the coding system maximum-likelihood estimator thus defined achieves the same optimal parameter estimation as in the usual situation in which data is fully available, is a sufficient statistic for the parameter because Since the number of different types is at most (cf. Csisz´ar and K¨orner [19]), the rate constraint (3.6) is enough to attain this coding scheme. Noting that (3.6) means the zero-rate data compression (3.7) we conclude that the zero-rate information is enough to achieve the optimal estimation with full data available. Thus in all these single-user cases of hypothesis testing, pattern classification, and parameter estimation, the zero-rate information always guarantees the same optimal performance As a consequence, the as in the case with rate single-user statistical inference system is really trivial and hence entirely uninteresting to us. As will be seen in the subsequent sections, however, the situation drastically changes if we are to consider the multiterminal system; although it will turn out that even in this multiterminal case the system is immediately decomposed into two trivial single-user systems if the generation of data and is statistically independent.

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

site C can fully observe data generated at site B. Let us now consider the following hypothesis testing with

IV. MULTITERMINAL HYPOTHESIS TESTING AT POSITIVE RATES In this section we describe several but “almost all” known results in the field of multiterminal hypothesis testing with positive rates With the multiterminal coding system, as stated in Section I, in mind, we first give the formal statement of the problem. by Let us define two integer sets

Encoder at site A maps each element to the corresponding element Similarly, encoder at site B maps each element to On the other hand, the corresponding element at site C maps decoder to the corresponding element each element The region (4.1) is called the acceptance region. Define the error probabilities of the first and the second kind as in (1.7) and (1.8) of Section I. The rate constraints as in (1.3) of Section I are more formally written as (4.2) (4.3) We first impose the constant-type constraint on

2305

as (4.4)

is an arbitrarily fixed constant. Let denote the minimum of over all possible and decoder satisfying conditions (4.2)– encoders (4.4), and define

(null hypothesis) (alternative hypothesis)

denote joint probability distributions on denotes the product probability measure, and and are i.e., , respectively. This kind of hypothesis the marginals of testing is called testing against independence. Let where

, and

(4.8) means that form a Markov chain where is the mutual information (e.g., cf. in this order and Cover and Thomas [20]). Then, we have Theorem 4.1 (Ahlswede and Csisz´ar [3]): Consider the hypothesis testing with (4.6) and (4.7). Then, for all and all (4.9)

where variable

denotes the number of values taken by random , and is the mutual information.

Remark 4.1: The point in the derivation of Theorem 4.1 is , to reproduce at the decoder the joint type where is the encoded version of the auxiliary data generated given data according to the conditional probability We would like to make the following remark. If (the entropy of ), we can set in (4.8), and hence we have

where

(4.5) which is called the optimal error exponent for the hypothesis testing. The definition (4.5) is asymptotically equivalent to

(4.10) and The right-hand side of (4.10) for all is nothing but the same optimal error exponent as that in the is fully available, as is well known in case where data the field of statistics. Now, instead of (4.6) and (4.7), let us consider the general hypothesis testing as follows with (null hypothesis) (alternative hypothesis)

The final goal of the multiterminal hypothesis testing is to as a function completely determine the value of and In addition, if possible, we want to attain the of However, this single-letter characterization for does not hold in most cases, and so in general we cannot help to be satisfied with reasonably “good” lower (and/or upper) bounds on Throughout the present paper we use the convention that denotes the probability distribution of random variable and denotes the conditional probability distribution of given random variable random variable The first result concerns the full side-information case ). This means that the decoder at (indicated by

(4.6) (4.7)

(4.11) (4.12)

are arbitrary joint distributions on where In this general case we cannot necessarily have the exact However, it is possible single-letter formula for to derive rather reasonable lower bounds on Let

(4.13) and for each

let

(4.14)

2306

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Remark 4.2: The lower bound in (4.25) is tighter than that of Ahlswede and Csisz´ar [3] with full sideinformation

and define

(4.15) where variables

Remark 4.3: In the special case of testing against independence, (4.25) reduces to (4.26)

is the divergence and the random are uniquely specified by the conditions (4.16) (4.17)

Then, we have the following theorem. The point in the derivation of this theorem is to reproduce at the decoder the set of joint types

In view of (4.9), this means that the lower bound tight in this case.

It is obvious that if we consider the case of full side, then Corollary 4.1 boils down to information Corollary 4.3: For all

and all (4.27)

(4.18) are the encoded versions of the auxiliary data where generated according to the conditional probabilities (given data ), (given data ), respectively. and all

Theorem 4.2 (Han [4]): For all

(4.19) Corollary 4.1: For all

and all (4.20) we can set as Then, from (4.15)

Proof: If because it follows that

(4.21) The right-hand side of (4.21) is the same optimal error are fully exponent as that in the case where data available (Stein’s lemma: e.g., cf. Cover and Thomas [20]). Hence, (4.19) yields (4.20). As a special case of Theorem 4.2 we may consider the full In order to describe it, side-information case define (4.22) (4.23)

is defined in (4.8) and the random variable where uniquely specified by the condition

Corollary 4.2 can be further put forth to Theorem 4.3 below as follows. All the results stated up to now in this section have been established by using those encoding schemes such that as in (4.18) is exactly produced at the decoder the statistic with zero-error probability. It is also possible, however, to consider a wider class of encoders that afford exponentially decaying nonzero-error probabilities. If we would incorporate this kind of wider class of encoding schemes, we need to simultaneously take into consideration the tradeoff aspects between two kinds of error probabilities, i.e., one due to the encoding error in specifying an acceptance region at the decoder, and the other due to the error of hypothesis testing given an acceptance region. Then, at the expense of coding error probabilities, the rates needed for effective hypothesis testing could be reduced to a considerable extent, and so it could be made is expected that the exponent significantly larger. This is the basic idea underlying Theorem 4.3 below. In order to describe Theorem 4.3, define (4.28) (4.29) is the conditional mutual information and are the conditional entropies. Furthermore, in (4.22) and in (4.24) in with the definitions of mind, we define

where

(4.30)

is

(4.31) if otherwise

(4.24) Then, Theorem 4.2 yields Corollary 4.2: For all

where and all

is

(4.32)

Moreover, define (4.33)

(4.25)

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

2307

then, we have Theorem 4.3 (Han, Shimokawa, and Amari [11]): For all and all (4.34) is due to the error in hyRemark 4.4: The exponent pothesis testing with an acceptance region identified, whereas is due to the error in identifying the the exponent acceptance region. It is easy to see from (4.8) and (4.28) that , and also we see from (4.30) and (4.32) that if It then follows from (4.33) that

Fig. 3. Minimization:  ("j0; R) for the zero-rate hypothesis testing.

First, as a special case of the zero-rate compression problem, we address the one-bit compression case. The one-bit compression is defined by (5.3) (5.4) which means that the lower bound in Theorem 4.3 is in Theorem significantly tighter than the lower bound is replaced 4.2. One reason is that the constraint It is possible here by the weaker constraint to generalize Theorem 4.3 so as to hold also in the case with general positive rates Remark 4.5: Consider a rate

because

and

,

(5.5) then we have

such that (4.35)

Then we can set so that we have

Each of (5.3) and (5.4) are indicated by respectively. Define

Theorem 5.1 (Han [4]): For all (5.6)

in this case, Theorem 5.2 (Han [4]): Suppose that holds, then there exists some constant and all all

from which it follows that satisfying (4.35) it is concluded that for rates

such that for

Hence, (5.7) (4.36)

Since

Combining Theorem 5.1 with Theorem 5.2 immediately yields (Fig. 3)

implies (4.37)

expression (4.36), with rates in (4.35), is an improvement of This Corollary 4.3, provided that is not case takes place when the alternative hypothesis very far from the null hypothesis V. ZERO-RATE MULTITERMINAL HYPOTHESIS TESTING In this section let us consider the zero-rate compression case. We define the zero-rate compression by

Corollary 5.1: Suppose that then there exists some constant and all

holds, such that for all (5.8)

Remark 5.1: Theorems 5.1 tells us that only one-bit inforis enough to attain a mation about the generated data or On positive error exponent, provided the other hand, Theorem 5.2 tells us that

(5.1) (5.2) Each of (5.1) and (5.2) is indicated by respectively.

and

,

provided that

and

Shalaby and Papamarcou [6] have shown a variant with of Theorem 5.2, which is stated as follows. zero-rate

2308

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Theorem 5.3 (Shalaby and Papamarcou [6]): Assume that the positivity condition (5.9) is satisfied. Then, for all

and all (5.10)

Combining Theorem 5.1 with Theorem 5.3 immediately yields Corollary 5.2: Assume that the positivity condition (5.9) is and all satisfied. Then, for all (5.11) Fig. 4.

in TheoRemark 5.2: The condition rem 5.2 is replaced by the stronger condition (5.9) in Theorem in Theorem 5.2 is 5.3, while the condition in Theorem 5.3. Thus neither Theorem 5.2 relaxed to nor Theorem 5.3 subsumes the other. The positivity condition (5.9) is needed in order to invoke “Blowing-up lemma” [7] in the proof of Theorem 5.3. We now consider the zero-rate hypothesis testing problem with the exponential-type constraint on the error probability of the first kind as follows:

Minimization:  (r jj0; R) for the zero-rate hypothesis testing.

Theorem 5.5: Assume that positivity condition (5.9) is satand all isfied. Then, for all (5.17) Proof: The proof is given later in Section X. A consequence of Theorems 5.4 and 5.5 is Corollary 5.3: Assume that positivity condition (5.9) is and all satisfied. Then, for all

(5.12) is an arbitrary fixed constant. This constraint is where asymptotically equivalent to

(5.18)

Let us now proceed to the one-bit compression problem. In this case, the statement of the result is a little bit complicated. Define

Let denote the minimum of the error probaof the second kind over all possible encoders bility and decoder satisfying conditions (4.2), (4.3), and (5.12), and define the optimal error exponent by

(5.19) (5.20) (5.21)

(5.13) In the sequel we will demonstrate several established results in the on the single-letter characterization of zero-rate compression case. Defining (5.14) for some

(5.22) where

is specified in (5.14). Then, we have

Theorem 5.6 (Han and Kobayashi [5]): Suppose holds. Then, for all

(5.15)

we have the first result as follows (Fig. 4).

that (5.23)

Remark 5.3: In the right-hand side of (5.23), neither

Theorem 5.4 (Han and Kobayashi [5]): For all (5.16)

The converse counterpart of Theorem 5.4 is novel, which is stated as

nor

hold. Generally speaking, for large (Fig. 5).

for small

and

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

Fig. 5.

Minimization:

2309

 (rjj02 ; 02 ) for the one-bit hypothesis testing.

Remark 5.4: If we let

The asymptotic performance of an asymptotically unbiased estimator is measured by its covariance matrix

in (5.23) we have

(6.4) the right-hand side of which coincides with to Corollary 5.1.

owing where “ ” denotes transpose. Since it is common that cois asymptotically proportional to the inverse of variance blocklength , we define

VI. MULTITERMINAL PARAMETER ESTIMATION AT POSITIVE RATES In this section we summarize some previously known results in the field of multiterminal parameter estimation with positive rates Suppose that a family (6.1) is given, where is an open of joint distributions on is subset of the -dimensional Euclidean space and assumed to be a twice continuously differentiable function of We consider the multiterminal data compression system with and decoder as in Section IV. universal encoders an i.i.d. random variAs in Section I, we denote by ables of blocklength subject to joint probability distribution Then, in general, an estimator of the is written as in (1.11) of Section I, parameter at decoder i.e., (6.2) are satisfying rate constraints (4.2), If the encoders is said to be (4.3), then the corresponding estimator -achievable. Moreover, the estimator is said to be asymptotically unbiased, if (6.3)

(6.5) which we call the covariance index. Any covariance inis said to be -achievable if the dex is asymptotically unbiased and corresponding estimator -achievable. It should be noted here that depends not only on but also on the sequence of encoders and the sequence of decoders designed so as to satisfy rate constraints (4.2) and as small as possible (4.3). We want to make as long as rate constraints (4.2) and (4.3) are satisfied. Generally speaking, however, there does not exist a “uniformly is the optimal” coding scheme in the sense that and all For “minimum” for all this reason, what matters here in general is whether a given is -achievable or not. covariance index This situation is in sharp contrast with that in the multiterminal hypothesis testing as treated in Sections IV and V. Remark 6.1: If an estimator the stronger condition

satisfies, instead of (6.3),

(6.6) then the estimator unbiased.

is said to be strongly asymptotically

2310

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Remark 6.2: Instead of (6.4), we may consider the meansquare error (MSE) matrix

Remark 6.3: Equation (6.18) can be rewritten in a more compact form (cf. Han and Amari [15]) as

(6.7) It is evident that two matrices by the relation

and

are connected (6.8)

(6.19)

Here, instead of (6.5), we may define the MSE index as follows:

This expression does not contain the function which suggests that the role of the function is more We remark that substantial than that of the function for all that does not satisfy the rate constraint (6.16) and (6.17).

(6.9) is in general In view of (6.8), the MSE index in the sense larger than the covariance index is nonnegative-definite. However, that we are considering is strongly when the estimator asymptotically unbiased in the sense of Remark 6.1, then, by means of (6.6) and (6.8), we have (6.10) Thus as far as we are concerned, only with strongly asymptotically unbiased estimators, we can indifferently identify and We now state the estimator of Zhang and Berger. They and consider the one-dimensional parameter case make a very stringent assumption that there exists a function (the set of real values) such that (6.11) is an unbiased estimator for the parameter (the additivity Let condition), where be auxiliary random variables taking values in (arbitrary finite sets) and satisfying the following conditions: a) b) c)

Markov chain depends only on depends only on There exists a function such that

(6.12) (6.13) (6.14) (6.15)

Then, we have Theorem 6.1 (Zhang and Berger [12]): With any given there exists an -achievable strongly rates such that, for any asymptotically unbiased estimator satisfying the rate constraint

Remark 6.4: It should be noted that the stringent additivity condition (6.11) is not preserved under parameter transformations. Also, the right-hand side of either of (6.18) or (6.19) is of somewhat peculiar form and does not seem to provide reasonable intuitive insights into further possible developments, and so it is very likely that remains to be much more improved. In fact, it is the case, as will be seen in the sequel. Remark 6.5: Zhang and Berger [12] have also considered the multiterminal estimation problem for correlated Gaussian sources and shown several computations of (6.18) for this case. They have also investigated, again under the additivity condition, the asymptotic performance of the modified estimator that is a linear combination of the estimator and the marginal types Next, let us proceed to the estimator of Han and Amari. We sketch a rough outline of the process to derive their estimator. First, they, like in the case of Zhang and Berger, with values in introduce auxiliary random variables , respectively, that satisfy conditions (6.12)–(6.14) alone. Then, we can write as (6.20) (6.21) We consider the following coding scheme. Suppose that was generated at sites A and B. First, denote data by , respectively, and approximate by fixed conditional types (cf. Csisz´ar and K¨orner [19]) so that (6.22) (6.23)

(6.16) (6.17) the estimator

achieves the following variance index:

Let us consider a parameter

that satisfies the conditions (6.24) (6.25) (6.26)

(6.18) where “ ” indicates the variance.

Then, there exist (cf. Han and Amari [15] for the details) some , under rate universal encoders

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

constraints (4.2) and (4.3), that satisfy the constraint conditions

2311

Similarly, we define

(6.27) (6.28) generated according to joint probability with data , where we have put for denote the conditional simplicity, and given and the conditional type of given , type of can also transmit respectively. In addition, the encoders the joint types and to the decoder , respectively, with zero rates. The decoder can compute also the joint type in and Let the joint probability of the addition to be statistic (6.29) The decoder can then derive the maximum-likelihood equation and thereby construct the maximum-likelihood estimator by using the formula for the probability distribution (6.29). Thus the whole problem boils down to how to compute the in (6.29), although explicit expression for this task is extremely complicated from the technical point of view. In order to write down the maximum-likelihood equation, we need some preparations. First, setting

(6.33) where ginal, and the Furthermore, set

are the -marginal of

-marginal, the -marrespectively.

(6.34) and are -dimensional We notice that all of -dimensional (row) vectors, Han vectors. With these and Amari [15] have shown that the sought-after maximumlikelihood equation can be written asymptotically in the matrix form as (6.35) is some (rectangular) projection matrix representing where the structure of the asymptotic linear constraints (6.27), (6.28) is due to the encoding operation (for details, see [15]); and a positive–definite symmetric matrix defined as follows. Set (6.36) let denote the direct product of over ; for example, if then , and the index runs over all the elements Moreover, for notational simplicity, with any subset in let us denote the -marginal of by Then, for any subsets the of is defined to be element For any subset sets

define (6.37) where restrict those subsets

is the Kronecker’s delta. Here, we only to the elements of the index set

Let us consider the joint probability distribution on (6.38) Then, it can be checked that the dimension of

is

(6.30) and define its “partial” derivative

by

Finally, denoting by the estimator obtained by solving the maximum-likelihood equation (6.35), we have (6.31)

where “ ” denotes the derivative with respect to Let the -marginal, the -marginal, and the -marginal of be denoted by and , respectively, and put

Theorem 6.2 (Han and Amari [15]): With any given rates the -achievable maximum-likelihood estimais strongly asymptotically unbiased and, for any tor satisfying the rate constraint (6.24)–(6.26), the estimator achieves the following covariance index: (6.39) where

(6.32)

(6.40)

2312

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Remark 6.6: It can be checked (see [15]) that the in (6.40) is the Fisher information dimensional matrix index of the statistic as defined in (6.34), where the Fisher information index is defined to be

for the Fisher information matrix This means that the is the best possible for all covariance index as far as we use only the statistic which is guaranteed by the Cram´er–Rao bound (cf. Rao [16]). Remark 6.7: It is obvious that the maximum-likelihood equation (6.35) has the form invariant under parameter transformations. Thus the form of the maximum-likelihood essystematically changes according to parameter timator transformations. Remark 6.8: The estimator of Zhang and Berger [12] under the additivity condition (6.11) is constructed on the basis of partial use of the statistic , which together with Remark 6.6 ensures that the inequality

(6.41) always holds when, as in [12], the dimension of the parameter is specialized to In addition, it should be noted here that conditions (6.24)–(6.26) in the case of Han and Amari imply conditions (6.16) and (6.17) in the case of Zhang and there exists Berger. This means that with the same rates of the parameter space such that some nonempty subset is effective (i.e., ) the estimator while the estimator is not effective (i.e., for all ) for all Remark 6.9: A geometrical interpretation of the above is process of deriving the maximum-likelihood estimator by as follows (Fig. 6). Define the data space

(6.42) is the data type observed at Consider to minimize the divergence over all and all (Double minimization) and let

where the decoder

(6.43) Then it is possible to check by using the Taylor expansion around that of the divergence This is the geometrical interpretation of the estimator (cf. Remark 8.3). In closing this section, we now describe the minimax estimator of Ahlswede and Burnashev [14]. As was mentioned

Fig. 6. Double minimization: geometrical interpretation for the estimator ^HA with positive rates.

in Section II, they have considerd the special case with oneand full side-information dimensional parameter such that the marginal of the is independent from the parameter joint probability Here, the alphabets and are not assumed to be finite. is independent from The assumption that the marginal is made to avoid such delicate arguments on the universal coding as in the cases of Zhang and Berger [12] and Han and Amari [15]. Furthermore, in order to elude the subtle problem of uniform optimality, they took the minimax estimation is independent from , approach. Since the marginal instead of in (6.12). Accordingly, an we can write of blocklength subject to i.i.d. random variables can be joint probability distribution Let written as (6.44) where denotes random variables with values in an arbitrary does not finite set such that the conditional distribution Therefore, these random variable ’s do not depend on does not depend on Denoting by depend on since the Fisher information of , define (6.45) (6.46) the maxmin Fisher information index. With We call this these definitions, Ahlswede and Burnashev have shown the following theorem under some regularity conditions, which gives the limiting (not single-letterized) formula for the optimal minimax variance index. Theorem 6.3 (Ahlswede and Burnashev [14]): -achievable strongly asymptotically 1) For any unbiased estimator (6.47)

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

2) There exists an -achievable strongly asymptotsuch that ically unbiased estimator

2313

by using the formula for the probability distribution (7.7). In order to describe this maximum-likelihood equation, set, as in Section VI,

(6.48) Remark 6.10: Although Theorem 6.3 shows that the inverse gives the optimal minimax variance index, it is very of as a function of For this hard to compute the values of reason, Ahlswede and Burnashev [14] have defined, instead of (6.45), another type of Fisher information, say, the “minimax” Fisher information as follows:

and define

(7.8)

(6.49) Furthermore, given a family and claimed the validity of the following important “local single-letterization” (cf. [14, Lemma 4]): (6.50)

(7.9) of joint distributions, we define the following dimensional (row) vectors:

This result was used by them to provide single-letterized upper Unfortunately, however, the proof of (6.50) bounds on does not seem to be free of serious technical flaws. (Notice in must not depend on !), and so we believe that does not single-letterize in general. that VII. ZERO-RATE MULTITERMINAL PARAMETER ESTIMATION In this section we describe the zero-rate estimator of Amari [13]. The derivation of this estimator is along the same line as the one shown in the latter part of Section VI to Although the arguments adopted derive the estimator in Section VI were indeed involved and subtle, the arguments to be shown below are much simpler, because the auxiliary do not intervene here. We remark random variables was first derived and here that, historically, the estimator was then generalized to the estimator First, consider the rate constraints (7.1)

-

(7.10) (7.11) (7.12) denote the -marginal and the -marginal where , respectively, and “ ” denotes the derivative with of respect to the parameter Amari [13] has shown by a simple calculation that the maximum-likelihood equation can be written asymptotically as (7.13) is the -dimensional positive–definite symwhere -element is defined metric matrix whose of (7.8) in place of of (6.38). More as in (6.37) with specifically (7.14)

(7.2)

(7.15)

which means the zero-rate data compression

(7.16)

(7.3) (7.4) Under rate constraints (7.1) and (7.2), we can define encoders by (7.5) (7.6) and Also, the encoding where scheme given by (7.5) and (7.6) is universal. Let the joint be denoted by probability of the statistic (7.7) The decoder can then derive the maximum-likelihood equation and thereby construct the maximum-likelihood estimator

the estimator obtained by solving the Then, denoting by maximum-likelihood equation (7.13), we have Theorem 7.1 (Amari [13]): The zero-rate maximumis strongly asymptotically unbiased likelihood estimator it achieves the following covariance index: and for all (7.17) where (7.18) Remark 7.1: It is not difficult to check (see [13]) that the -dimensional matrix in (7.18) is the Fisher information index of the statistic as defined in (7.12).

2314

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Remark 7.2: It is easy to check that the maximumlikelihood equation (7.13) coincides with that in (6.35) with the set to a constant variable, auxiliary random variables and with the projection matrix set to the i.e., reduces to the estimator unity. Accordingly, the estimator It should be noticed, however, that the process of deriving with positive rates was extremely complicated (cf. Han and Amari [15] ) mainly because there they had to ingeniously elaborate a type-preserving universal coding scheme (along with the corresponding projection matrix ) whereas by introducing auxiliary random variables here in the zero-rate case no such is needed, as may be seen from (7.5) and (7.6). Remark 7.3: If neither nor depends on , then This observation correwe have sponds to Remark 5.1 for the zero-rate hypothesis testing. In contrast with the estimator , the zero-rate estimator can be shown to be always “optimal,” although this optimality problem had been left open for a long time. This converse counterpart of Theorem 7.1 is novel, and stated as

where denote the -marginal and the , respectively. Clearly, marginal of alone does not provide any information about , because is independent from Therefore, we the -marginal using only the -marginal type may think of an estimator ; for example, which is unbiased with variance (7.25) On the other hand, the maximum-likelihood estimator using both of and is constructed in terms of as follows. The covariance in this case is given by

(7.26)

Hence,

Theorem 7.2: Suppose that the positivity condition (7.27)

(7.19) -achievable covariance index is satisfied. Then, any satisfies the inequality (7.20) is the Fisher information index as in (7.18) and where ” between matrices means that is inequality “ nonnegative–definite. Proof: The proof, given later in Section XI, takes a large deviation approach combined with the converse theorem for the zero-rate hypothesis testing. Remark 7.4: Inequality (7.20) is a multiterminal version of Cram´er–Rao bound, and Theorem 7.1 tells us that this bound An immecan actually be attained by the simple estimator is diate consequence of Theorem 7.2 is that the estimator “uniformly most powerful” under zero-rate data compression, attains the covariance index because Remark 7.5: Theorems 7.1 and 7.2 for the zero-rate parameter estimation are in nice correspondence to Theorems 5.4 and 5.5 for the zero-rate hypothesis testing, respectively. Example 7.1: Consider the zero-rate compression estimafor the binary source with tion where is given by (7.21)

which is positive for The maximum-likelihood equation (7.13)

where turns out to be

(7.28) By solving this equation, we have the estimator (7.29) This is nonlinear in

and

than the variance of Obviously, this value is smaller by in (7.25). Therefore, is the amount of statistical The information contributed by the ancillary statistic variance (7.30) is optimal under the zero-rate compression, owing to Theorem 7.2. Remark 7.6: The same geometrical observation mark 6.9 for the case with positive rates to be valid also here in the zero-rate case only In this case, in (6.42) reduces

as in Recontinues if we set to (7.31)

with minimization

, whereas (6.43) reduces to the double

(7.23) (7.24)

is (7.30)

(7.22) Then, we have

The variance of

(7.32) with

(cf. Remark 8.3).

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

VIII. INFORMATION-GEOMETRICAL CONSIDERATIONS So far we have demonstrated that, in the multiterminal hypothesis testing, the pertinent problems of minimizing divergences under certain constraints played a key role in establishing single-letter characterizations for (lower bounds on) the optimal error exponents Similarly, we have noticed in Remark 6.9 that also in the multiterminal parameter estimation the same kind of minimization problems intervene in establishing single-letter characterizations for (upper bounds on) the optimal covariance index In this section, we will give some information-geometrical interpretations for these minimization problems (for the general exposition, see Amari and Han [9]; for the general framework of information geometry in terms of diffential geometry, see Amari [22] and Amari and Nagaoka [23]). Let us suppose that we are given a joint probability distribution on , where

2315

Let us next define the -coordinates of

by (8.13)

where (8.14) (8.15)

(8.16) Using these distribution

-coordinates

we can uniquely specify the as

(8.1) (8.2) (8.17)

The distribution is uniquely specified by giving the values for all such that of However, in the multiterminal systems it is sometimes more convenient to use other kinds of coordinate systems in order to specify these joint distributions. by We first define the -coordinates of

(8.18) (8.19) (8.20)

(8.3)

We now consider decomposing each of these coordinates into two parts as follows: (8.21) (8.22)

where (8.4) (8.5)

where (8.23) (8.24)

and (8.6)

(8.25) (8.26)

(8.7) (8.8) Using these distribution

-coordinates

we can uniquely specify the as

(8.9)

of With these decompostions, we can use the first part and the second part of to specify We will call the mixed coordinates of the distribution A great advantage of the mixed coordinate system is the following theorem concerning the minimization of divergences under certain constraints. Denote the mixed coordinates of two joint on by and distributions respectively. Given an define the set of joint by distributions on (8.27)

(8.10) Moreover, define the joint distribution (8.11) (8.12)

on

by (8.28)

Then we have.

2316

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Fig. 7. Pythagorean theorem: geometrical interpretation for the zero-rate hypothesis testing.

Theorem 8.1 (Pythagorean Theorem: Amari and Han [9]):

Lemma 8.1 (Amari [13], Han and Amari [15]): (8.31)

(8.29) where the mixed coordinates of

(8.32)

are given by

(8.33) Remark 8.1: If we set , then the set is actually defined by (5.5) in Section V, and the the same one as minimization (8.28) is the same one as that on the right-hand side of (5.6) in Theorem 5.1 for the zero-rate multiterminal hypothesis testing. From the information-geometrical point consists of the probability distributions of view, this set and contains having the same marginals specified by This set is -flat in the sense that linear mixtures of again belong to We may call the -fiber is on The set of all the probability distributions on , where product decomposed in the union of all -fibers, This is a distributions may be taken as representatives of On the other hand, given a , we can construct foliation of the set, by using (8.30) is This set is -flat, which contains , in the sense that closed under the operation of connecting its elements by an the -fiber on The exponential family. We may call , define another foliation of The union of all -fibers, two foliations are orthogonal to each other in the sense that and are mutually orthogonal with the Fisher two sets information matrix as the metric (Amari [13]; see Fig. 7). Let us now consider the differential counterpart of Theorem be the row vectors representing the differentials 8.1. Let of the -coordinates and the -coordinates, respectively, and by denote the differential of the joint distribution Then, by direct calculation, we can easily check the validity of the following Lemma.

is the -dimensional positive-definite where symmetric matrix as specified by (6.37) with the index set (8.34) in place of the index set

of (6.38).

It is also possible to describe Theorem 8.1 in terms of the mixed coordinates. To this end, partition the matrix as

(8.35) then a simple direct calculation using (8.31) and (8.32) immediately yields the following lemma which demonstrates that can actually be decomposed into two parts; and the other is the quadratic one is the quadratic form of form of Lemma 8.2 (Differential Pythagorean Theorem: Amari [13]):

(8.36) Remark 8.2: The two quantities on the right-hand side of and on the (8.36) correspond to right-hand side of (8.29), respectively. The positive–definite on the right-hand side of (8.36) is the same as submatrix

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

2317

Fig. 8. Double minimization: geometrical interpretation for the zero-rate estimator ^A :

that defined in Section VII whose dimension is Also, some linear algebra shows that the -dimensional matrix (8.37) on the right-hand side of (8.36) is the inverse of the submatrix of

(8.38)

and hence it is also positive-definite. Now, think of the case where one another with

and

are very close to

(8.39) (8.40) Then, by means of Theorem 8.1, the mixed coordinates

defined by (8.28) has

(8.41) which together with (8.28) and Lemma 8.2 with and yields place of Theorem 8.2: Let respectively. Then,

Remark 8.3: Theorem 8.2 (the differential version for the minimization of divergences) provides a very powerful tool in studying the multiterminal parameter estimation problems. For example, application of Theorem 8.2 to the minimization (7.32) in Remark 7.6 immediately yields the equation (Single minimization) (8.43) which is nothing but the maximum-likelihood equation (7.13) already driven for the multiterminal zero-rate estimation (Fig. 8). This can be easily checked by differentiating the quadratic form on the right-hand side of (8.43) with respect to Also, later in Step 5 of Section XI, we effectively invoke Theorem 8.2 in order to prove Theorem 7.2 (the converse part) for the zero-rate estimation problem. Similarly, it is technically very complicated but logically not difficult in principle to generalize the differential version in Theorem 8.2 so that its application to the minimization (6.43) in Remark 6.9 yields the maximum-likelihood equation (6.35) for the multiterminal estimation with positive rates. IX. ZERO-RATE MULTITERMINAL PATTERN CLASSIFICATION Let us consider a finite number of pattern classes which generate an i.i.d. data subject to joint probability distributions

in

be defined by (8.27) and (8.40),

(8.42)

on , respectively. We denote by independent The problem of statistical replicas of pattern recognition is stated as follows: Given an i.i.d. data generated by one of ’s, decide the true class which generated the data. In the multiterminal case, the statistibut uses cian cannot have access to the original full and so that the decoder the encoded messages

2318

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

maps the encoded messages to a pattern class. The set (9.1)

In order to state the new results, we use the following concepts. In the set of all the probability distributions on , the subset (9.6)

is called the acceptance region of pattern In order to evaluate the performance of the above system, we use the error probability in the Bayesian framework. Let be the prior probability of pattern Let be the is classified to by the probability that pattern from system. It is written as

denotes the is the -fiber on (cf. (8.27)), where on , which specify the coordinates of distribution marginal distributions of defined by (8.6) and (8.7). the divergence from Given a fixed distribution on to is defined by the fiber

(9.2)

(9.7)

(9.3)

and , the equi-separator For two distributions of and is defined as the set consisting of those fibers whose distances to and to are equal

Hence, the average error probability is given by

be the minimum of over all possible Let and under constraints (4.2) and (4.3). It is known that in decreases to exponentially with , so general that we put

(9.8) and

The separating divergence of

is defined by

(9.9)

(9.4) We further define This is called the optimal error exponent for the multiterminal pattern classification. In the case of the single-user system, or in the case of the and , the multiuser system with The problem statistician at cite C can use the full data is then easily solved, where the optimal acceptance regions are given (cf. Kanaya and Han [24]) by

(9.10) against

The acceptance region of

is defined by (9.11)

and the acceptance region of

by (9.12)

(9.5) On the other hand, needless to say, the problem is generally nontrivial in the multiterminal case. One may say that the classification problem sits between hypothesis testing and parameter estimation. When the number of pattern classes is infinite and patterns are labeled by a continuous parameter , it reduces to the estimation problem. Here, one of the difficulties lies in our being forced to use only universal coding schemes, because the encoders do not know which class the data belongs to. However, the problem is similar to hypothesis testing when one treats only a finite number of distributions. The difference is that here all the pattern classes have equal standpoints, in hypothesis testing plays a whereas the null hypothesis special role different from the alternative The pattern classification is an important problem in its own right. Results from hypothesis testing and estimation may be useful for obtaining bounds of the error exponent of pattern classification. However, there have appeared no papers treating this important problem except for a preliminary study by Amari and Han [9] where the zero-rate case with is treated from the information-geometrical point of view. The present paper generalizes their results to the case of an arbitrary finite number of pattern classes, and also gives the proof of the converse theorem for the first time.

Then, we have the following generalization of Amari and Han [9]: Theorem 9.1: The zero-rate optimal error exponent of pattern classification is lower-bounded as (9.13) and map and Proof: Let the encoding functions to their types, and , respectively. As in (2.1) of Section II, we set

Based on these messages, we can construct the -fiber specified by the two marginal type distributions Design the decoder by where (9.14) when is the closest from in That is, the sense of the divergence. (When there are two or more equal minimal candidates, assign to any one of them at random.) The for class is then written as error probability (9.15)

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

Then, by virtue of the Large Deviation Theorem, it follows that

2319

in place of

, we have

(9.16) By (9.3) and (9.10), we thus have (9.17)

where we have used (9.22) and (9.23). This obviously contradicts (9.21), thus proving Theorem 9.2.

This proves the achievability of Remark 9.1: The error exponent on the prior distribution

does not depend because (9.18)

where the second term on the right-hand side of (9.18) converges to as tends to Remark 9.2: The acceptance region defined in (9.12) is “optimal,” provided that we are restricted to use the message only. The converse counterpart of Theorem 9.1 is novel, which is stated as follows. Theorem 9.2: Assume that and all

(9.24)

and Then, for all

A consequence of Theorems 9.1 and 9.2 is Corollary 9.1: Assume that and all

and Then, for all

for all (9.25)

Remark 9.3: We can generalize this result to the case where to is spefied by , the cost of misclassifying class where the target is minimizing the average cost. We have discussed only the zero-rate case. It does not seem to be so difficult to obtain some bounds in the positive rate case by combining the results of hypothesis testing and the universal coding scheme for parameter estimation.

for all X. PROOF

OF

THEOREM 5.5

(9.19)

In this section, we give the proof of Theorem 5.5 stated in Section V. In order to prove it, we need the following lemma.

be the two distributions specified

be an Lemma 10.1 (Han and Kobayashi [5]): Let subject to probability i.i.d. random sequence of length on Let be any subset of distribution such that

and let us consider the hypothesis testing against In order to invoke the contradiction argument, assume with that there exists a zero-rate coding system for which the error exponent of rates We use the decision rule which classification is larger than when the output of is and rejects when accepts is not Then, in view of (9.2), (9.3), and the output of of the first kind and Remark 9.1, the error probabilities the second kind for this hypothesis testing obviously satisfy

(10.1)

Proof: Let

and

by

holds ( type on

is a constant). If we denote by and put

any fixed joint (10.2) (10.3)

then we have (10.4)

(9.20) (9.21) On the other hand, by the definition of on such that a distribution

where Let be an arbitrary acceptance region for hypothesis testing (4.11) and (4.12) such that

there exists (9.22)

(10.5) where (10.6)

from which it follows that (9.23) where of

is defined by (5.14) and (5.15) with in place Moreover, by means of (9.20) and Theorem 5.5 with

Equations (10.5) and (10.6) imply that (10.7) where

is an arbitrarily small constant.

2320

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

Next, select an arbitrary “internal point” is specified in (5.14). Then, clearly where

of

,

which together with Theorem 5.3 yields (10.18)

(10.8) Define

where (10.19)

(10.9)

(10.20) is an arbitrary constant. Then, in view of (10.8) where and the uniform continuity of the divergence, for all it holds that (10.10) and sufficiently small. provided that we take yields Consequently, (10.4) in Lemma 10.1 with

was arbitrary as far On the other hand, we notice that as condition (10.8) is satisfied. Therefore, in the light of the , we see that the infimum of the rightdefinition (5.15) of hand side in (10.18) over all possible internal points satisfying (10.8) coincides with (10.21)

(10.11) for all

Thus (10.18) reduces to

Now we define the set (10.12)

and consider an i.i.d. random sequence of length subject to Then, by means of (10.11), we probability distribution have

(10.22) that is, (10.23) which was what to be proven. XI. PROOF

OF

THEOREM 7.2

In this section we give the proof of Theorem 7.2 stated in Section VII in several steps. Step 1: We start with providing a lemma that plays a crucial role later in Step 5. We consider the general multiterminal hypothesis testing with (4.11) and (4.12), that is, (null hypothesis) (alternative hypothesis)

(10.13)

(11.1) (11.2)

where in the last step we have used the fact (cf. [19]) under zero-rate constraint (10.14) Now consider the zero-rate testing with

(11.3)

hypothesis is arbitrary Notice the error probabilities of the first kind and the second kind as specified by (1.7) and (1.8) can be written as

while

(null hypothesis)

(10.15)

(alternative hypothesis)

(10.16) (11.4)

as above. Then, by virtue and the same acceptance region of (10.13), for this hypothesis testing the error probability of the first kind satisfies the “constant-type” constraint (10.17)

(11.5) is the acceptance region as specified by (4.1). The where following lemma is a more precise restatement of Theorem 5.3 in Section V.

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

Lemma 11.1 (Shalaby and Papamarcou [6]): With the hybe any fixed pothesis testing (11.1) and (11.2), let constant and suppose that the positivity condition (5.9) holds. If the constraint (11.6)

2321

Let us now consider the one-dimensional parameterization of by such that (11.18) then we have (11.19)

, then there exists a sequence

is satisfied with zero-rate such that

because of define the estimator

We

(11.7) (11.8)

(11.20)

(11.9)

is which is easily seen to be asymptotically unbiased since denote the variance index of asymptotically unbiased. Let for the parameter value Then, the variance the estimator of the estimator at coincides with index the left-hand side of (11.17), i.e.,

(1.10)

(11.21)

where we have put

with

Clearly, when for the parameter by

defined in (5.5), i.e.,

so that (11.17) is rewritten as Remark 11.1: Marton [8] has shown an extended version of Blowing-up lemma [7] used to establish Lemma 11.1, in Lemma 11.1 is which tells us that the sequence and , and hence independent from the hypotheses and may depend on This fact will be used later in Step 5. Step 2: Let us have a familily of joint distributions

(11.22) Step 3: set With any sufficiently large positive integers and suppose that data of length were generated into at sites A and B, respectively. We divide these data subblocks of length as (11.23)

(11.11)

(11.24) which is continuously twice differentiable with respect to the and satisfies the positivity parameter condition (7.19). Consider any asymptotically unbiased for the parameter whose covariance achievable estimator index ( -dimensional matrix) is given by (11.12) In order to prove the theorem by the contradiction argument, suppose that

be the estimator using data where is the estimator of the parameter defined in (11.20). Since we are considering a jointly subblocks are i.i.d. data independent from one another and hence these estimators are also mutually independent. Define by another estimator and let

(11.25)

(11.13) then its variance at the parameter value

is given by

is the Fisher information index does not hold, where specified by (7.18). Then, there exists some and an -dimensional vector such that (11.14) Here, we can normalize

so as to satisfy the condition

(11.26) On the other hand, since we have assumed that the estimator is asymptotically unbiased, there exists an -dimensional such that for all sequence (11.27) (11.28)

(11.15) If we set

then (11.15) yields (11.16)

We now consider a sequence

such that (11.29)

so that (11.14) is written as (11.17)

(11.30)

2322

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

with as specified in Lemma 11.1 in Step 1. Then, since it follows from (11.19), (11.20), (11.25), and (11.27) that

then (11.43) (11.44)

(11.31)

where Let us consider in Lemma 11.2 the case (cf. (11.25)) where

we have

(11.45) (11.46)

(11.32) Case of a) In this case we have (11.33) where in the last step we have taken account of (11.29). Hence

(11.47) Put

(11.34) Step 4: denote the left-hand side of (11.30), i.e., Let

(11.48) then, from (11.34), we have (11.49)

(11.35) Then (11.30) is equivalent to (11.36) from which, together with (11.35), it follows that

In view of (11.37) and (11.49) it is easy to check that the satisfies condition (11.42). Then, taking choice (11.48) of account of (11.40), (11.48), and (11.49), application of (11.43) in Lemma 11.2 gives the evaluation for the error probability of the first kind as

(11.37) Let us now consider in the hypothesis testing with (11.1), (11.2) the following case using the given family (11.11) of distributions: (null hypothesis)

(11.38)

(alternative hypothesis)

(11.39)

and according to the We choose one of hypotheses as follows: the null hypothesis value of the statistic is adopted if and the alternative hypothesis is adopted otherwise, where (11.40) with We need here the following lemma.

(11.50) where that

Moreover, it follows from (11.37) (11.51)

, if we set On the other hand, since Thus (11.51) and (11.50) with that

then conclude (11.52)

Lemma 11.2 (Billingsley [21]): Let (11.41)

Case of b) In this case, by means of (11.29) with have

instead of

we

are finite-valued independent where and identically distributed random variables with mean and If a sequence satisfies the condition variance (11.42)

(11.53)

HAN AND AMARI: STATISTICAL INFERENCE UNDER MULTITERMINAL DATA COMPRESSION

2323

On the other hand, with the notation in Section VII (cf. (7.10)) we see that

Put (11.54)

(11.60) which together with (11.18), (11.34), and (11.59) yields

then, from (11.34) we have

(11.61)

(11.55) instead of , and (11.55) In view of (11.29), (11.30) with satisfies it is easy to check that the choice (11.54) of condition (11.42). Then, taking account of (11.40), (11.54), and (11.55), application of (11.44) in Lemma 11.2 gives the of the second kind as evaluation for the error probability

where “ ” denotes the derivative with respect to the parameter Then, substitution of (11.61) into (11.58) gives

(11.62) where we have taken account of the definition (7.18) of the Since Fisher information index by (11.52), the assumption (11.6) in Lemma 11.1 with instead of is satisfied. Hence, by virtue of Lemma 11.1, the inequality (11.63) (11.56) where then

as Let us here also set since Thus in view of and (11.36) with in place of

,

must hold. Substitution of (11.57) and (11.62) into (11.63) leads us to

can

be written as

(11.64) (11.57)

where Summarizing, it is concluded that the zero-rate hypothesis testing with (11.38) and (11.39) has specified by (11.52) and (11.57), the error probabilities respectively. Step 5: We now think of an application of Lemma 11.1 in Step 1 to the hypothesis testing with (11.38) and (11.39) (cf. Remark 11.1) using the same decision rule as described in Step 4. This is to derive a contradiction with the fact shown in Step 4. We first observe that, in the light of (11.29) and (11.34), the and asymptotically hypotheses tends to infinity. It is also coincide with one another when (defined by (8.27)) in Theorem 8.2 easy to check that (defined by (11.10)) in Lemma 11.1 if we coincides with and in Theorem 8.2, so that the set in (11.9). As a left-hand side of (8.42) coincides with consequence, by virtue of Theorem 8.2, we can write (11.58) where, denoting the mixed coordinates (cf. Section VIII) of by and , respectively, we have set (11.59)

Taking the logarithm of both sides in (11.64) and dividing , we obtain them by (11.65) On the other hand, it follows from (11.30) with that of

in place (11.66)

Thus noting that (11.65), it is concluded that

and letting

in

(11.67) which obviously contradicts (11.22), thereby proving Theorem 7.2. REFERENCES [1] T. Berger, “Decentralized estimation and decision theory,” presented at the IEEE 7th Spring Workshop on Information Theory, Mt. Kisco, NY, Sept. 1979. [2] J. Rissanen, “Universal coding, information, prediction and estimation,” IEEE Trans. Inform. Theory, vol. IT-30, pp. 629–636, 1984. [3] R. Ahlswede and I. Csisz´ar, “Hypothesis testing with communication constraints,” IEEE Trans. Inform. Theory, vol. IT-32, pp. 533–542, July 1986. [4] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Trans. Inform. Theory, vol. IT-33, pp. 759–772, Nov. 1987.

2324

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 6, OCTOBER 1998

[5] T. S. Han and K. Kobayashi, “Exponential-type error probabilities for multiterminal hypothesis testing,” IEEE Trans. Inform. Theory, vol. 35, pp. 2–14, Jan. 1989. [6] H. M. H. Shalaby and A. Papamarcou, “Multiterminal detection with zero-rate data compression,” IEEE Trans. Inform. Theory, vol. 38, pp. 254–267, Mar. 1992. [7] R. Ahlswede, P. G´acs, and J. K¨orner, “Bounds on conditional probabilities with applications in mult-user communication,” Z. f¨ur Wahrscheinlichkeitstheorie und ver wandte Gebiete, vol. 34, pp. 157–177, 1976; correction, ibid. vol. 39, pp. 353–354, 1977. [8] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inform. Theory, vol. IT-32, pp. 445–446, May 1986. [9] S. Amari and T. S. Han, “Statistical inference under multiterminal rate restrictions: A differential geometric approach,” IEEE Trans. Inform. Theory, vol. 35, pp. 217–227, Mar. 1989. [10] H. M. H. Shalaby and A. Papamarcou, “Error exponent for distributed detection of Markov sources,” IEEE Trans. Inform. Theory, vol. 40, pp. 397–408, Mar. 1994. [11] T. S. Han, H. Shimokawa, and S. Amari, “Error bound of hypothesis testing with data compression,” in Proc. IEEE Int. Symp. Information Theory (Trondheim, Norway, 1994), p. 29. [12] Z. Zhang and T. Berger, “Estimation via compressed information,” IEEE Trans. Inform. Theory, vol. 34, pp. 198–211, Mar. 1988. [13] S. Amari, “Fisher information under restriction of Shannon information in multi-terminal situations,” Ann. Inst. Statist. Math., vol. 41, no. 4, pp. 623–648, 1989. [14] R. Ahlswede and M. Burnashev, “On minimax estimation in the presence of side information about remote data,” Ann. Statist., vol. 18, no. 1, pp. 141–171, 1990. [15] T. S. Han and S. Amari, “Parameter estimation with multiterminal data compression,” IEEE Trans. Inform. Theory, vol. 41, pp. 1802–1833, Nov. 1995. [16] C. R. Rao, Linear Statistical Inference and its Applications, 2nd ed. New York: Wiley, 1973. [17] J. K¨orner and K. Marton, “How to encode the modulo 2 sum of two binary sources,” IEEE Trans. Inform. Theory, vol. IT-25, pp. 219–221, 1979. [18] T. S. Han and K. Kobayashi, “A dichotomy of functions F (X; Y ) of correlated sources (X; Y ) from the viewpoint of the achievable rate region,” IEEE Trans. Inform. Theory, vol. IT-33, pp. 69–76, Jan. 1987. [19] I. Csisz´ar and J. K¨orner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [20] T. M. Cover and J. Thomas, Elements of Information Theory. New York: Wiley, 1991. [21] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995. [22] S. Amari, Differential Geometrical Methods in Statistics (Lecture Notes in Statistics, no. 28). Berlin-Heidelberg: Springer, 1985. [23] S. Amari and H. Nagaoka, Introduction to Information Geometry, Iwanami Press, Applied Mathematics Series (in Japanese), 1995; English translation to be published by Amer. Math. Soc. and Oxford Univ. Press, 1998. [24] F. Kanaya and T. S. Han, “The asymptotics of posterior entropy and error probability for Bayesian estimation,” IEEE Trans. Inform. Theory, vol. 41, pp. 1988–1995, Nov. 1995. [25] R. A. Wiggins and E. A. Robinson, “Recursive solution to the multichannel filtering problem,” J. Geo-Phys. Res., vol. 70, no. 8, pp. 1885–1891,

1965. [26] R. R. Tenney and N. R. Sandell, Jr.,“Detection with distributed sensors,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-17, pp. 501–510, 1981. [27] S. D. Stearns, “Optimal detection using multiple sensors,” in Proc. 1982 Carnahan Conf. Crime Countermeasures, 1982. [28] L. K. Ekchian and R. R. Tenny, “ Detection networks,” in Proc. 21th IEEE Conf. Decision and Control (Orlando, FL, 1982), pp. 686–691. [29] H. J. Kushner and A. Pacut, “A simulation study of a decentralized detection problem,” IEEE Trans. Autonmat. Contr., vol. AC-27, pp. 1116–1119, 1982. [30] D. Teneketzis, “The debentralized Wald problem,” in IEEE 1982 Int. Large-Scale Systems Symp. (Virginia Beach, 1982), pp. 423–430. [31] D. Teneketzis and P. Varaiya, “The decentralized quickest detection problem,” IEEE Trans. Automat. Contr., vol. AC-29, no. 7, pp. 641–64, 1984. [32] J. Tsitsiklis and M. Athins, “On the complexity of distributed decision problems,” IEEE Trans. Automat. Contr., vol. AC-30, no. 5, pp. 440–446, 1985. [33] Z. Chair and P. K. Varshney, “Optimal data fusion in multiple sensor detection systems,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-22, pp. 98–101, 1986. [34] R. Srinivasan, “Distributed radar detection theory,” Proc. Inst. Elec. Eng., vol. 133, pt. F, no. 1, pp. 55–60, 1986. [35] F. A. Sadjadi, “Hypothesis testing in a distributed environment,” IEEE Trans. Aerosp. Electron. Syst., vol. AES-22, pp. 134–137, 1986. [36] J. J. Chao and C. C. Lee, “A distributed detection scheme based on soft local decisions,” in Proc. 24th Annu. Allerton Conf. Communicayion, Control and Computing, 1986, pp. 974–983. [37] S. C. A. Thomopoulos, R. Viswanathan, and D. P. Bougoulias, “Optimal decision fusion in multiple sensor systems,” IEEE Trans. Aerosp. Electron Syst., vol. AES-23, pp. 644–653, 1987. [38] V. Alalo, R. Viswanathan, and S. C. A. Thomopoulos, “A study of distributed detection with correlated sensor noise,” presented at Globecom 87, 1987. [39] T. J. Flynn and R. M. Gray, “Encoding of correlated observations,” IEEE Trans. Inform. Theory, vol. IT-33, pp. 773–787, Nov. 1987. [40] B. Picinbono and P. Duvaut, “ Optimal quantization for detection,” IEEE Trans. Commun., vol. 36, pp. 1254–1258, Nov. 1988. [41] M. M. Al-Ibrahim and P. K. Varshney, “Non-parametric sequential detection for multi sensor data,” in Proc. 1989 Johns Hopkins Conf. Information Sciences and Systems, 1989, pp. 157–162. [42] N. Sayiner and R. Viswanathan, “Distributed detection in jamming environment,” IEEE Trans.Aerosp. Electron Syst., 1990. [43] D. Kazakos, V. Vannicola, and M. C. Wicks, “Signal detection,” in Proc. 1989 Johns Hopkins Conf. Information Sciences and Systems, 1989, pp. 180–185. [44] H. R. Hashemi and I. B. Rhodes, “Decentralized sequential detection,” IEEE Trans. Inform. Theory, vol. IT-35, pp. 509–520, May 1989. [45] I. Y. Hoballah and P. K. Varshney, “An information theoretic approach to the distributed detection problem,” IEEE Trans. Inform. Theory, vol. 35, pp. 988–994, 1989. , “Distributed Bayesian signal detection,” IEEE Trans. Inform. [46] Theory, vol. 35, pp. 995–1000, 1989. [47] T. S. Han and K. Kobayashi, “Multiterminal filtering for decentralized detection systems,” IEICE Trans. Commun., vol. E75-B, no. 6, pp. 437–444, 1992.