since the papers of Fisher, Shannon, and Kullback. There are many measures each claiming to capture the concept of information or simply being measures.
1 On Measures of Information and Divergence and Model Selection Criteria
Alex Karagrigoriou and Takis Papaioannou University of Cyprus University of Piraeus, Greece
Abstract1 : In this paper we discuss measures of information and divergence and model selection criteria. Three classes of measures, Fisher-type, divergencetype and entropy-type measures, are discussed and their properties are presented. Information through censoring and truncation is presented and model selection criteria are investigated including the Akaike Information Criterion (AIC) and the Divergence Information Criterion (DIC). Keywords and phrases: Measures of information and divergence, acid-test, additivity property, weighted distribution, censoring and truncation, model selection, AIC, DIC
1.1
Introduction
Measures of information appear everywhere in probability and statistics. They also play a fundamental role in communication theory. They have a long history since the papers of Fisher, Shannon, and Kullback. There are many measures each claiming to capture the concept of information or simply being measures of (directed) divergence or distance between two probability distributions. Also there exist many generalizations of these measures. One may mention here the papers of Lindley and Jaynes who introduced entropy based Bayesian information and the maximum entropy principle for determining probability models, respectively. 1 Part of this work was done while the second author was Visiting Professor at the University of Cyprus
1
2
Alex Karagrigoriou and Takis Papaioannou
Broadly speaking there are three classes of measures of information and divergence: Fisher-type, divergence-type, and entropy (discrete and differential)type measures. Some of them have been developed axiomatically (see, for example, Shannon’s entropy and its generalizations), but most of them have been established operationally in the sense that they have been introduced on the basis of their properties. There have been several phases in the history of information theory: Initially we have (i) the development of generalizations of measures of information and divergence (f−divergence, (h−f)−divergence, hypo-entropy, etc), (ii) the synthesis (collection) of properties they ought to satisfy, and (iii) attempts to unify them. All this work refers to populations and distributions. Later on we have the emergence of information or divergence statistics based on data or samples and their use in statistical inference primarily in minimum ”distance” estimation and for the development of asymptotic tests of goodness of fit or model selection criteria. Lately we have a resurgence of interest on measures of information and divergence which are used in many places, in several contexts and in new sampling situations. The measures of information and divergence enjoy several properties such as non-negativity, maximal information, sufficiency etc and statisticians do not agree on all of them. There is a body of knowledge known as statistical information theory which has made many advances but not achieved a wide acceptance and application. The approach is more operational rather than axiomatic as it is the case with Shannon’s entropy. There are several review papers which discuss the above points. We mention the following: Kendall (1973), Csiszar (1977), Kapur (1984), Aczel (1986), Papaioannou (1985 and 2001), Soofi (1994 and 2000). The aim of this paper is to present recent developments on measures of information and divergences and model selection criteria. In particular, in Section 2 we present a number of measures of information and divergence while in Section 3 we discuss the most important properties of these measures. In Section 4 we review information and divergence under censoring including measures associated with weighted distributions and truncated data. Finally in Section 5 we cover issues related to model selection by discussing the well known AIC criterion (Akaike, 1973) and introducing the Divergence Information Criterion (DIC).
1.2
Classes of Measures
As it was mentioned earlier there are three classes of measures of information and divergence: Fisher-type, divergence-type, and entropy-type measures. In what follows assume that f (x, θ) is a probability density function (pdf) corresponding to a random variable X and depending on a parameter θ. At other
On Measures of Information and Divergence and Model Selection Criteria
3
places X will follow a distribution with pdf f1 or f2 .
1.2.1
Fisher-Type Measures
The Fisher’s measure of information introduced in 1925 is given by £ h i ¤ E ∂ ln f (X, θ) 2 = −E ∂ 22 ln f (X, θ) , θ univariate ∂θ ∂θ F ° h i° IX (θ) = ° ° ∂ ∂ θ k − variate °E ∂θi ln f (X, θ) ∂θj ln f (X, θ) ° , where || · || denotes the norm of a matrix. The above is the classical or expected information while the observed Fisher information where θˆ an estimate of θ, is given by 2 ˆ θ) − ∂ ln f (X, , 1 observation ∂θ2 F ˆ IX (θ) = ˆ 1 ,...,xn ) ∂ 2 ln f (θ|x − , n observations. ∂θ2 i2 h F = E ∂ ln f (X) or equivFinally the Fisher information number is given by IX ∂x alently by · 2 ¸ ¸ · ∂ ln f (X) ∂ ln f (X) 2 F JX (θ) = −E =E − [f 0 (b) − f 0 (a)], ∂x2 ∂x where a and b are the endpoints of the interval of support of X. Vajda (1973) extended the above definition by raising the score function to a power a, a ≥ 1 for the purpose of generalizing inference with loss function other than the squared one which leads to the variance and mean squared error criteria. The corresponding measure for a univariate parameter θ is given by ¯a ¯ ¯ ¯∂ V ¯ IX (θ) = E ¯ ln f (X, θ)¯¯ , a ≥ 1. ∂θ In the case of a vector-parameter θ, Ferentinos and Papaioannou (1981) F P (θ) any eigenvalue or special funcproposed as a measure of information IX tions of the eigenvalues of Fisher’s information matrix, such as the trace or its determinant. Finally Tukey (1965) and Chandrasekar and Balakrishnan (2002) discussed the following measure of information (∂µ/∂θ)2 , X univariate ∼ f (x, θ), θ scalar σ2 TB IX (θ) = (∂µ/∂θ)0 Σ−1 (∂µ/∂θ), X vector where µ and σ 2 (matrix Σ for the vector case) are the mean and the variance of the random variable X.
4
1.2.2
Alex Karagrigoriou and Takis Papaioannou
Measures of Divergence
A measure of divergence is used as a way to evaluate the distance (divergence) between any two populations or functions. Let f1 and f2 be two probability density functions which may depend or not on an unknown parameter of fixed finite dimension. The most well known measure of (directed) divergence is the Kullback-Leibler divergence which is given by Z KL IX (f1 , f2 ) = f1 ln(f1 /f2 )dµ for a measure µ. If f1 is the density of X = (U, V ) and f2 is the product of KL is the well known mutual or relative the marginal densities of U and V , IX information in coding theory. The additive and non-additive directed divergences of order a were introduced in the 60’s and the 70’s (Renyi, 1961, Csiczar, 1963 and Rathie and Kannappan, 1972). The so called order a information measure of Renyi (1961) is given by Z 1 R IX (f1 , f2 ) = ln f1a f21−a dµ, a > 0, a 6= 1. a−1 It should be noted that for α tending to 1 the above measure becomes the Kullback-Leibler divergence. Another measure of divergence is the measure of Kagan (1963) which is given by Z ± Ka IX (f1 , f2 ) = (1 − f1 f2 )2 f2 dµ. Csiszar’s measure of information (Csiszar, 1963) is a general divergence-type measure, known also as ϕ-divergence based on a convex function ϕ. Csiszar’s measure is defined by Z C IX (f1 , f2 ) = ϕ(f1 /f2 )f2 dµ, where ϕ is a convex function in [0, ∞) such that 0ϕ(0/0) = 0, ϕ(u) → 0 and u→0
0ϕ(u/0) = uϕ∞ with ϕ∞ = limu→∞ [ϕ(u)/u]. Observe that Csiszar’s measure reduces to Kullback-Liebler divergence if ϕ(u) = u ln u. If ϕ(u) = (1 − u)2 or ϕ(u) = sgn(a − 1)ua , a > 0, a 6= 1 Csiszar’s measure yields the Kagan (Pearson’s X2 ) and Renyi’s divergence respectively. Another generalization of measures of divergence is the family of power divergences introduced by Cressie and Read (1984) which is given by "µ # ¶ Z f1 (z) λ 1 CR f1 (z) − 1 dz, λ ∈ R, IX (f1 , f2 ) = λ (λ + 1) f2 (z)
On Measures of Information and Divergence and Model Selection Criteria
5
where for λ = 0, 1 is defined by continuity. Note that the Kullback-Leibler divergence is obtained for λ tending to 0. One of the most recently proposed measures of divergence is the BHHJ power divergence between f1 and f2 [Basu et. al (1998)] which is denoted by BHHJ, indexed by a positive parameter a, and defined as: µ ¶ ¾ Z ½ 1 1 1+a 1+a BHHJ a IX (f1 , f2 ) = f2 (z) − 1 + f1 (z) f2 (z) + f1 (z) dz, a > 0. a a Note that the above family which is also referred to as a family of power divergences is loosely related to the Cressie and Read power divergence. It should be also noted that the BHHJ family reduces to the Kullback-Leibler divergence for α tending to 0 and to the standard L2 distance between f1 and f2 for α = 1. The above measures can be defined also for discrete settings. Let P = (p1 , p2 , . . . , pm ) and Q = (q1 , q2 , . . . , qm ) be two discrete finite probability distriC (P, Q) = butions. Then the discrete version of Csiszar’s measure is given by IX m P qi ϕ (pi /qi ) while the Creasie and Read divergence is given by
i=1
m
CR IX (P, Q)
X 1 pi = λ(λ + 1) i=1
"µ
pi qi
¶λ
# − 1 , λ ∈ R.
The discrete version of the BHHJ measure can be defined in a similar fashion. For a comprehensive discussion about statistical inference based on measures of divergence the reader is referred to Pardo (2006).
1.2.3
Entropy-Type Measures
Let P = (p1 , p2 , . . . , pm ) be a discrete finite probability distribution associated with a r.v. X. Shannon’s entropy is defined by X S HX =− pi ln pi . It was later generalized by Renyi (1961) as entropy of order a: R HX =
X 1 ln pai , a > 0, a 6= 1. 1−a
A further generalization along the lines of Csiszar’s measure based on a convex function ϕ, known as ϕ−entropy, was proposed by Burbea and Rao k P ϕ (1982) and is given by HX =− ϕ(pi ). Finally, it is worth mentioning the i=1
entropy measure of Havrda and Charvat (1967): P 1 − pai C , a > 0, a 6= 1 HX = a−1
6
Alex Karagrigoriou and Takis Papaioannou
which for a = 2 it becomes the Gini - Simpson index. Other entropy-type measures include the γ−entropy given by ³P ´ 1/γ γ 1− pi γ HX = , γ > 0, γ 6= 1 1 − 2γ−1 and the paired entropy given by X X P HX =− pi ln pi − (1 − pi ) ln(1 − pi ), where pairing is in the sense of (pi , 1 − pi ) [cf. Burbea and Rao (1982)].
1.3
Properties of Information Measures
The measures of divergence are not formal distance functions. Any bivariate function IX (·, ·) that satisfies the non-negativity property, namely IX (·, ·) ≥ 0 with equality iff its two arguments are equal can possibly be used as a measure of information or divergence. The three types of measures of information and divergence share similar statistical properties. Several properties have been investigated some of which are of axiomatic character and others of operational. Here we will briefly mention some of these properties. In what follows we shall use IX for either IX (θ1 , . . . , θk ), k ≥ 1, the information about (θ1 , . . . , θk ) based on the r.v. X or IX (f1 , f2 ), a measure of divergence between f1 and f2 . One of the most distinctive properties is the additivity property. The weak additivity property is defined as IX,Y = IX + IY , if X is independent of Y while the strong additivity is defined by IX,Y = IX + IY |X , where IY |X = E(IY |X=x ) is the conditional information or divergence of Y |X. The sub-additivity and super-additivity properties are defined through the weak additivity when the equal sign is replaced with an inequality: IX,Y ≤ IX + IY (sub − additivity) and IX,Y ≥ IX + IY (super − additivity). Observe that super- and sub-additivity are contradictory. Subadditivity is not satisfied for any known measure except Shannon’s entropy [cf. Papaioannou (1985)]. Super-additivity coupled with equality iff X and Y independent is satisfied by Fisher’s information number (Fisher’s shift-invariant information)
On Measures of Information and Divergence and Model Selection Criteria
7
and mutual information [cf. Papaioannou and Ferentinos (2005) and Micheas and Zografos (2006)]. Super-additivity generates measures of dependence or correlation while sub-additivity stems from the conditional inequality (entropy). Three important inequality properties are the conditional inequality given by IX|Y ≤ IX , the nuisance parameter property given by IX (θ1 , θ2 ) ≤ IX (θ1 ) where θ1 the parameter of interest and θ2 a nuisance parameter and the monotonicity property (maximal information property) given by IT (X) ≤ IX for any statistic T (X). Note that if T (X) is sufficient then the monotonicity property holds as equality which shows the invariance property of the measure under sufficient transformations. Let positive numbers α1 and α2 such that α1 + α2 = 1. Also let f1 and f2 be two probability density functions. The convexity property is defined as IX (α1 f1 + α2 f2 ) ≤ α1 IX (f1 ) + α2 IX (f2 ). The order preserving property has been introduced by Shiva, Ahmed and Georganas (1973) and shows that the relation between the amount of information contained in a r.v X1 and that contained in another r.v. X2 remains intact irrespectively of the measure of information used. In particular, if the superscripts 1 and 2 represent two different measures of information then 1 1 2 2 IX ≤ IX → IX ≤ IX . 1 2 1 2
The limiting property is defined by fn → f iff IX (fn ) → I(f ) or IX (fn , f ) → 0, where fn is a sequence of probability density functions, f is the limiting probability density function and I(fn ) and I(fn , f ) are measures of information based on one or two pdfs respectively. We finally mention the Ali-Silvey property. If f (x, θ) (or simply fθ ) has a monotone likelihood ratio in x then θ1 < θ2 < θ3 → IX (fθ1 , fθ2 ) < IX (fθ1 , fθ3 ). Other important properties include the loss of information and the sufficiency in experiments. For details see Ferentinos and Papaioannou (1981) and Papaioannou (1985).
8
1.4
Alex Karagrigoriou and Takis Papaioannou
Information under Censoring and Truncation
Let X be the variable of interest and Y the censoring variable. We observe (Z, δ) where Z = min(X, Y ) and δ = I[X≤Y ] an indicator function. The full likelihood for (Z, δ) is ¯ θ)]δ [g(z, θ)F¯ (z, θ)]1−δ , L(z, δ) = [f (z, θ)G(z, ¯ = 1−G where f and g the pdf’s of X and Y, F and G the cdf’s of X and Y, G ¯ and F = 1 − F . The Fisher information about θ contained in (Z, δ) is given by µ F I(Z,δ) (θ)
=E
+∞µ +∞µ ¶2 ¶2 ¶2 Z Z ∂ ∂ ∂ ¯ ¯ log L(Z, δ) = log f G dz + log g F dz. ∂θ ∂θ ∂θ −∞
−∞
Consider now f1 and f2 two different pdf’s for the random variable X. Then the Csiszar’s ϕ−divergence between f1 and f2 based on (Z, δ) is defined as C I(Z,δ) (f1 , f2 )
+∞ +∞ µ¯ ¶ µ ¶ Z Z F1 f1 ¯ ¯ dz + g F2 ϕ ¯ dz. = f2 Gϕ f2 F2 −∞
−∞
The basic properties of the above (complete) censored measures of information have been investigated by Tsairidis, Ferentinos and Papaioannou (1996). In random censoring two additional properties introduced by Hollander et. al (1987; 1990) and called ”acid test” are appropriate. They are the maximal information property given by (i) E[Information(X)] ≥ E[Information(Z, δ)] for every X, Y and the censoring property given by (ii) E[Information(Z1 , δ1 )] ≥ E[Information(Z2 , δ2 )] for every X, where (Zi , δi ) is the censored variable associated with Yi , and Y1 0.
1.5.1
(1.2)
The Expected Overall Discrepancy
The target theoretical quantity that needs to be estimated is given by ¯ ³ ´ ¯ ˆ EWθˆ = E Wθ ¯θ = θ
(1.3)
where θˆ is any consistent and asymptotically normal estimator of θ. This quantity can be viewed as the average distance between g and fθ up to a constant and is known as the expected overall discrepancy between g and fθ . Observe that the expected overall discrepancy can be easily evaluated. More specifically, the derivatives of (1.2) in the case where g belongs to the family {fθ } are given by (see Mattheou and Karagrigoriou (2006)): ¸ ·Z ¡ ¢ ∂Wθ a 1+a = (a + 1) uθ (z) fθ (z) dz − Eg uθ (z) fθ (z) = 0, (a) ∂θ ½ Z Z ∂ 2 Wθ 2 1+a (b) = (a + 1) (a + 1) [uθ (z)] fθ (z) dz − iθ fθ1+a dz ∂θ2 ³ ´o ¡ ¢ a a = (a + 1)J +Eg iθ (z) fθ (z) − E g a [uθ (z)]2 fθ (z) R 1+a ∂ ∂2 where uθ = ∂θ (log (fθ )), iθ = − ∂θ [uθ (z)]2 fθ (z) dz. 2 (log (fθ )) and J = Using a Taylor expansion of Wθ around the true point θ0 and for a pdimensional parameter θ, we can show that (1.3) at θ = θˆ takes the form ·³ ´ ³ ´0 ¸ (a + 1) EWθˆ = Wθ0 + E θˆ − θ0 J θˆ − θ0 . (1.4) 2
1.5.2
Estimation of the Expected Overall Discrepancy
In this section we construct an unbiased estimator of the expected overall discrepancy (1.4). First though we shall deal with the estimation of the unknown
12
Alex Karagrigoriou and Takis Papaioannou
¡ a ¢ density g. An estimate of (1.2) w.r.t. g is given by replacing Eg fθ (Z) by its sample analogue Z 1+a
Qθ =
fθ
µ ¶ n 1 1X a (z) dz − 1 + fθ (Xi ) a n
(1.5)
i=1
with derivatives given by # "Z n ∂Qθ 1X a 1+a (a) uθ (Xi ) fθ (Xi ) , a > 0, = (a + 1) uθ (z) fθ (z) dz − ∂θ n i=1
½ Z Z ∂ 2 Qθ 2 1+a = (a + 1) (a + 1) [uθ (z)] fθ (z) dz − iθ fθ1+a (z) dz (b) ∂θ2 ) n n 1X 1 X a 2 a + iθ (z) fθ (z) − a [uθ (z)] fθ (z) n n i=1
i=1
. It is easy to see that by the weak law of large numbers, as n → ∞, we have: · ¸ ¸ · 2 ¸ · ¸ · 2 ∂Qθ ∂ Qθ ∂Wθ ∂ Wθ P P and −→ . (1.6) −→ ∂θ θ0 ∂θ θ0 ∂θ2 θ0 ∂θ2 θ0 ˆ expressions (1.5) and (1.6) and a Taylor expansion of The consistency of θ, ˆ Qθ around the point θ can be used to evaluate the expectation of the estimator Qθ evaluated at the true point θ0 : ·³ ´ ³ ´0 ¸ a+1 ˆ ˆ ≡ Wθ0 . E θ − θ0 J θ − θ0 EQθ0 ≡ E (Qθ |θ = θ0 ) = EQθˆ + 2 ·³ n ´ ³ ´0 ¸ o ˆ ˆ As a result (1.4) takes the form: EWθˆ = E Qθˆ+(a+1) θ − θ0 J θ − θ0 . It can be shown that under normality, − a2
J = (2π)
µ
1+a 1 + 2a
¶1+ p
h ³ ´i−1 α Σ− 2 V ar θˆ ,
2
ˆ Taking also into consideration where Σ is the asymptotic variance matrix of θ. ³ ´ h ³ ´i ³ ´ −1 0 α that θˆ − θ Σ− 2 V ar θˆ θˆ − θ has approximately a Xp2 distribution, the Divergence Information Criterion defined as the asymptotically unbiased estimator of EWθˆ is given by − a2
DIC = Qθˆ + (a + 1) (2π)
µ
1+a 1 + 2a
¶1+ p 2
p.
On Measures of Information and Divergence and Model Selection Criteria 13 Note that the family of candidate models is indexed by the single parameter a. The value of a dictates to what extent the estimating methods become more robust than the maximum likelihood methods. One should be aware of the fact that the larger the value of a the bigger the efficiency loss. As a result one should be interested in small values of a ≥ 0, say between zero and one. The proposed DIC criterion could be used in applications where outliers or contaminated observations are involved. The prior knowledge of contamination may be useful in identifying an appropriate value of a. Preliminary simulations with a 10% contamination proportion show that DIC has a tendency of underestimation in contrast with AIC which overestimates the true model.
1.6
Discussion
In this paper we attempted an overview of measures of information and divergence. We discussed several types of measures and several of the most important properties of these measures. We also dealt with measures under censoring and truncation as well as weighted distributions and order statistics. Finally we presented results related to the use of the measures of divergence in model selection criteria and presented the new Divergence Information Criterion. The measures of information and divergence have attracted the interest of the scientific community recently primarily due to their use in several contexts such as in communication theory and sampling situations. As a result, statisticians need to refocus on these measures and explore further their theoretical characteristics as well as their practical implications which constitute the main contributions on the field.
Acknowledgement The second author would like to thank Professors Ferentinos and Zografos and Dr. Tsairidis for their longtime research collaboration which lead to the development of the ideas mentioned in the first part of the paper.
References 1. Aczel, J. (1986). Characterizing information measures: Approaching the end of an era. In Uncertainty in Knowledge-Based Systems. Lecture Notes in Computer Science. (B. Bouchon and R.R. Yager, eds.), Spinger-Verlag, New York, 359-384. 2. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Proc. of the 2nd Intern. Symposium on Information Theory, (Petrov B. N. and Csaki F., eds.), Akademiai Kaido, Budapest.
14
Alex Karagrigoriou and Takis Papaioannou 3. Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85, 549-559. 4. Bayarri, M. J., DeGroot, M. H. and Goel, P. K. (1989). Truncation, information and the coefficient of variation. In Contributions to Probability and Statistics. Essays in Honor of Ingram Olkin (Gleser, L., Perlman, M., Press, S. J. and Sampson, A., eds.), Springer Verlag, New York, 412-428. 5. Burbea, J. and Rao, C. R. (1982). On the convexity of some divergence measures based on entropy functions. IEEE Trans. on Inform. Theory, IT-28, 489-495. 6. Chandrasekar, B. and Balakrishnan, N. (2002). On a multiparameter version of Tukey’s linear sensitivity measure and its properties. Ann. Inst. Statist. Math., 54, 796-805. 7. Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. R. Statist. Soc, 5, 440-454. 8. Csiszar, I. (1963). Eine informationstheoretische ungleichung and ihre anwendung auf den beweis der ergodizitat von markoffischen ketten. Magyar Tud. Akad. Mat. Kutato Int. Kozl., 8, 85-108. 9. Csiszar, I. (1977). Information measures: A critical review. Trans. 7th Prague Conf. (1974), Academia Prague, 73-86.
10. Ferentinos, K. and Papaioannou, T. (1981). New parametric measures of information. Information and Control, 51, 193-208. 11. Havrda, J. and Charvat, F. (1967). Quantification method of classification processes: Concept of structural a-entropy. Kybernetika, 3, 30-35. 12. Hollander, M., Proschan, F. and Sconing, J. (1987). Measuring information in right-censored models. Naval Research Logistics, 34, 669-681. 13. Hollander, M., Proschan, F. and Sconing, J. (1990). Information, censoring and dependence. Institute of Mathematical Statistics, Lecture Notes Monograph Series, Topics in Statistical Dependence, 16, 257-268. 14. Iyengar, S., Kvam, P. and Singh, H. (1999). Fisher information in weighted distributions. The Canadian Journal of Statistics, 27, 833-841. 15. Kagan, A. M. (1963). On the theory of Fisher’s amount of information (in Russian). Doklady Academii Nauk SSSR, 151, 277-278. 16. Kapur, J.N. (1984). A comparative assessment of various measures of divergence. Advances in Management Studies, 3, 1-16.
On Measures of Information and Divergence and Model Selection Criteria 15 17. Kendall, M.G. (1973). Entropy, probability and information. International Statistical Review, 11, 59-68. 18. Mattheou, K. and Karagrigoriou, A. (2006). A discrepancy based model selection criterion. Proc. 18th Conf. Greek Statist. Society, 485-494. 19. Micheas, A.C. and Zografos, K. (2006). Measuring stochastic dependence using φ-divergence. J. of Multiv. Anal., 97, 765-784. 20. Nelson, W. (1982). Applied life data analysis. Wiley, New York. 21. Papaioannou, T. (1985). Measures of information. In Encyclopedia of Statistical Sciences, Kotz and Johnson, Eds., Wiley, 5, 391-397. 22. Papaioannou, T. (2001). On distances and measures of information: a case of diversity. In Probability and Statistical Models with applications, C.A.Charalambides, M.V. Koutras and N. Balakrishnan, Eds., Chapman and Hall/CRC, London, 503-515. 23. Papaioannou, T. and Ferentinos, K. (2005). On two forms of Fisher’s measure of information. Comm. in Statist.-Theory and Methods, 34, 1461-1470. 24. Papaioannou, T., Ferentinos, K. and Tsairidis, Ch. (2006). Some information theoretic ideas useful in statistical inference. Methodology Computing and Applied Probability (to appear). 25. Rathie P.N. and Kannappan, P. (1972). A directed-divergence function of type β. Information and Control, 20, 38-45. 26. Renyi, A. (1961). On measures of entropy and information. Proc. 4th Berkeley Symp. on Math. Statist. Prob., 1, 547-561, Univ. of California Press. 27. Shiva, S., Ahmed, N. and Georganas, N. (1973). Order preserving measures of information. J. Appl. Prob., 10, 666-670. 28. Soofi, E.S. (1994). Capturing the intangible concept of information. JASA, 89, 1243-1254. 29. Soofi, E.S. (2000). Principal information theoretic approaches. JASA, 95, 1349-1353. 30. Tsairidis, Ch., Ferentinos, K. and Papaioannou, T. (1996). Information and random censoring. Information Sciences, 92, 159-174.
16
Alex Karagrigoriou and Takis Papaioannou
31. Tsairidis, Ch., Zografos, K., Ferentinos, K. and Papaioannou, T. (2001). Information in quantal response data and random censoring. Ann. Inst. Statist. Math., 53, 528-542. 32. Tukey, J. W. (1965). Which part of the sample contains the information? Proc. Nat. Acad. Sci. USA, 53, 127-134. 33. Vajda, I. (1973). X2 -divergence and generalized Fisher’s information. Trans. 6th Prague Conf. (1971), Academia Prague, 873-886. 34. Zheng, G. and Gastwirth, J. L. (2000). Where is the Fisher information in an ordered sample? Statistica Sinica, 10, 1267-1280. 35. Zheng, G. and Gastwirth, J. L. (2002). Do tails of symmetric distributions contain more Fisher information about the scale parameter? Sankhya, B 64, 289-300.