The Importance of Being Ignorant

GENERAL ç ARTICLE

The Importance of Being Ignorant Using Entropy for Interpretation and Inference R Nityananda

Rajaram Nityananda works in areas of physics and astronomy involving optical and statistical concepts. After increasing entropy at the Raman Research Institute in Bangalore for more than two decades, he moved recently to the National Centre for Radio Astrophysics of the Tata Institute of Fundamental Research at Pune.

8

In m any reallife situations,w e have to draw conclusions from data w hich are not com plete and have been a®ected by m easurem ent errors. Such problem s have been addressed from the tim e of B ayes and Laplace (late 1700's) using concepts w hich parallelB oltzm ann'suse ofentropy in therm al physics. T he idea is to assign probabilities to di®erent possible conclusions from a given set of data. A critical { and som etim es controversial{ input is a `prior probability',w hich represents our know ledge before any data are given or taken! T his body of ideas is introduced in this article w ith sim ple exam ples. From the earliest tim es, thinkers have recognised two distinctwaysoflearning abouttheworld welivein.O ur educationalsystem gives prom inence to the ¯rst one { `deduction'.T he bestexam ple isofcourse Euclid'sconstruction ofgeom etry from a few innocent looking axiom s. In the world of¯ction,Sherlock H olm es claim ed to `deduce'what had really happened in a crim e from a few clues. But in reality,what m ost ofus (Sherlock H olm es included) practise,should be called ìnduction'. Logicians have given this nam e to drawing conclusions from observations or experim ents by a rather di®erent process.To startwith,we havea large num berofpossiblehypothesesto choose from .O bservationsand experim entaldata are used to narrow down the possibilities. T he word `hypothesis'is being used in a rather sim ple sense here. For exam ple,ifwe are trying to determ ine the ellipticalorbitofan asteroid,the `hypothesis'isjust a set ofnum bers giving the plane ofthe orbit,the size and shape and orientation of the ellipse in this plane,

RESONANCE ç September

2001

GENERAL ç ARTICLE

Figure 1. The (unknown) orbit of the asteroid is shown by dashed lines. At three different times t1, t2, t3, observations give the three directions (but not distances) of E1 A1, E2 A2, and E3 A3. The earth’s orbit E1 E2 E3 is assumed known.

and w here the asteroid sits on the ellipse at a given tim e. W e do not directly m easure these num bers but ratherthe angularposition in the sky atdi®erenttim es, asseen from the earth w hich isitselfa m oving platform . T he situation is illustrated in Figure 1. G auss faced precisely this problem oforbit determ ination in the year 1801. A few observations ofC eres,the very ¯rst asteroid discovered, were available. H e invented the so called `m ethod ofleast squares'to choose the best orbit consistent w ith the m easurem ents available. W e now explain how his m ethod ¯ts in w ith our earlier generaldiscussion. To sim plify m atters,we will assum e,asin Figure 1,thatthe two orbits,ofearth and asteroid,liein a plane.W eshow in Figure2 two kindsof graphs.O ne,m adeup ofindividualpoints,givestheobservations. T he continuous curves,give the predictions ofdi®erent possible orbits (i.e.,hypotheses). O ur ¯rst reaction is that it needs only four num bers to specify theorbitin theplane.T hesecould bethex and y coordinatesofthe asteroid,and thex and y com ponents of its velocity, at a given tim e (January 1, 1801, for exam ple!). Fourm easurem entsoughtto be enough,and weshould beableto deducetheorbitw ithoutguesswork.


2001

Gauss invented the so called ‘method of least squares’ to choose the best orbit consistent with the measurements available.

9

GENERAL ç ARTICLE

Figure 2. The points show the actual observations. The continuous curve A shows what we might regard as the best orbit. B is another orbit at a greater distance than A. The vertical lines through the observed points represent errors of measurement.

ButG auss,although the prince ofm athem aticians,also knew the real world better. M easurem ents are never exact,and the points would not lie exactly on the predicted curveeven ifweknew theorbit!W ecan statethis in another way. For each m easurem ent,we can draw a verticalbarw hich representsthepossiblerangein w hich the true value (ofthe angle) could lie. Each point has now becom e `fuzzy'or`blurred'in the verticaldirection (T he m easurem ent along the x-axis, viz tim e, is usually very accurate and we do notworry aboutitserrors here.)

Measurements are never exact, and the points would not lie exactly on the predicted curve even if we knew the orbit!

10

N ow we can readily see that there is a corresponding fuzziness or uncertainty in the curve draw n though the points. W e have m oved from deduction to induction. O thernam esforthisprocessare ìnversion'(going back from the data to the hypothesis) and `statisticalinference.' G oing back to Figure 2, why do we choose the curve A ratherthan the curve B? A n experim enter would say that `the deviations ofcurve A from the m easurem ents areconsistentwith theerrorbars,whilecurve B lieswell outside the errorbars.


2001

GENERAL ç ARTICLE

N ow let us try and be m ore quantitative. Each error bar is really not a line with sharp lim its. Larger errors are less probable, but not im possible. In fact, G auss him self,building on the work ofde M oivre and Laplace, proposed that the probabi ³ lity ´for the errorto be x falls ¡ x2 o® proportionally to exp 2¾ 2 . T his is the bellshaped graph sketched in Figure A . B ox 1 gives a few m ore details about this rem arkable, w idespread distribution w hich we allcallgaussian. The basic m essage ofBox 1 is that the error is itselfthe sum ofm any sm aller contributions each of w hich m ay not have a gaussian disBox 1. The Gaussian Distribution A coin is tossed eight times. What is the most probable number of heads? Four of course. Why is eight heads less probable than four? Because there is only one way to get eight heads, HHHHHHHH. But there are 8C4=70 ways to get four heads, since we now have freedom to choose any four of the eight tosses to show heads. The full table of numbers is No. of heads

0

1

2

3

4

5

6

7

8

No. of cases

1

8

28

56

70

56

28

8

1

and they are plotted in Figure A To get the probability, we have to divide by 256. We have also superposed a bell shaped curve. This is how the probability for n heads behaves when the number of tosses is very large (of course, we have to relabel the axes if we have 158 tosses instead of 8!). This is the famous gaussian distribution. Its mathematical form is

!

(x ¡ m )2 P (x)= A exp ¡ : 2¾2 A is a constant of proportionality.

Figure A.

x = m is the peak of the curve and also the average value of x. s 2 is a constant which is called ‘variance’. It measures the average of the square of the deviation of x from m.


2001

11

GENERAL ç ARTICLE

tribution.Butthe sum does approach thislaw in m any cases. W e can think of the height of the gaussian as m easuring the num ber ofways that a given error could bebuiltup from the underlying individualcontributions (\errorlets'?). The logarithm ofthis num ber is,therefore,proportionalto ³

´

log(exp ¡ x2¾i=2¾i2 )= constant ¡

x2i : 2¾i2

¾i2 is the average of the square ofxi:W hy do we take thelogarithm ? Thisisa convenientthing to do w hen we want to m ultiply num bers! C om e back to our original problem ofdeterm ining thebestorbit(Figure 2).W hen we guess a given curve, A or B,we are autom atically attributing the deviations ofthe points from the curve to experim entalerror.So we should be asking ourselves { `W hatistheprobability thattheerrorstook thevalues thatwe are suggesting?'T hisprobability isobtained by m ultiplying gaussian functions for the individualerrors at each m easured point. W e now want to m axim ise the joint probability, i.e., the product of probabilities. So we m axim ise the logarithm ,which is log(Probability oferrors)= Ã

X

const+ another const ¡ i

!

x2i : 2¾i2

0

But let us remember that least squares is not sacred or perfect. It is only as good as the assumptions that went into it.

12

In thesim plecasew hereallthe¾isareequal,thism eans we have to m inim ise the sum ofthe squares ofallthe errors (because ofthe negative sign in frontofit).This is the fam ous m ethod of least squares, and it is em inently sensible. It prevents us from doing silly things like draw ing the theoreticalgraph wellaway from the points. It ensures that errors have both signs. But let us rem em ber thatleastsquaresisnotsacred orperfect. It is only as good as the assum ptions that went into it. W hen the errorsdo nothave a gaussian distribution,or


2001

GENERAL ç ARTICLE

w hen wehavesom e physicallim itsw hich restrictourorbit,wecan,and m ustdo better.O urexam plewasreally m eant to introduce a broaderfram ework forhypothesis testing. T his broader fram ework cam e even before G auss. It is attributed to B ayesand Laplace,who worked in thelate 1700's.1 T he basic (`B ayesic?') idea is to use a sim ple theorem ofconditionalprobability dueto Bayes(Box2). W e need it in the form .

We should warn the reader that there are many other approaches to statistical inference. The Bayesean approach of this article uses concepts closest to entropy. 1

Box 2. Bayes' Theorem for Conditional Probabilities One way of understanding this theorem is via Figure B in which points stand for events and areas stand for probabilities. The horizontally striped region A represents all cases or trials in which some event a occured. The vertically striped region B similarly stands for all instances of b. The intersection C of these two regions is cross hatched and represents cases where both a & b occured. We can now say Area of C = p(a,b) = joint probability of a and b Area of A = p(a) = probability of a

Figure B. Venn diagram illustrating Bayes’ theorem.

Area of B = p(b) = probability of b Conditional probability of a given that b has occured = p(a |b) area of C =

p(a,b) =

area of A

.

p(a)

Hence, p(a,b)=p(a | b) . p(a). Similarly, p(a,b) = p(b | a) p(b). Hence, equating these two, p (b | a) p(a) p(a | b) =

.

p(b) Since the left side is a function of a for fixed b, we can treat the denominator as a constant, as we have in the main text.


2001

13

GENERAL ç ARTICLE

Probability of H (given D) states the goal of all experimental science, viz., we are given data, and we try to assign probabilities to different hypotheses based on this data.

Probability ofH (given D )/ Probability ofD (given H ) m ultiplied by Probability ofH (notgiven anything.) O ur choice of notation is deliberate. H stands for hypothesis,D standsfordata.T helefthand sidestatesthe goalofallexperim entalscience,viz.,we are given data, and we try to assign probabilities to di®erent hypotheses based on thisdata. Thatiswhatthe notation P(H j D ) m eans. T he right hand side ofour equation tells us how we are to achieve thisgoal.Ithastwo factors.The ¯rstoneistheconditionalprobability P(D jH ).In words, given a hypothesis (orbit in our earlier exam ple) w hat is the probability that the given data could arise (e.g., angle m easurem ents ofthe asteroid)? W e have already talked about this w hen we m ultiplied gaussian (probability) distributions for the errors at the various experim entalpoints. In general,ifwe know how to predict w ith our hypothesis and we understand ourexperim ental errors, we should have no di± culty w ith P (D jH ). (A nd ifwe don'tthe ¯rstpriority isto do so!).O urearlier discussion stopped at P(D jH ) { which statisticians callthe`likelihood function'when regarded asa function ofH { for¯xed D .O fcourse,itisan honestprobability distribution for D ,w hen H is ¯xed. B uttherulesofprobability tellusthatthisisnotenough. W e have to face up squarely to the second factoron the right side P(H ). This is the unconditional probability that a particular hypothesis H is true. Since this has nothing to do with the data,it is called the `prior'distribution. Perhaps the philosophy ofK ant shaped this term inology. H e believed that som e things like space and tim e had to be given to us à priori',right at the beginning.W e already had som e kind ofpriordistribution in m ind in ourorbitproblem . W e only drew curves like A or B w hich were based on N ewton's laws ofm otion and gravitation,and did nottry others.U sing P(H ) to rejectwhatwe know to be im possibleeven beforethe observationsaretaken,isa good idea.ButP(H )also as-

14


2001

GENERAL ç ARTICLE

signsdi®erentweightsto two hypotheseswhich areboth possible to start w ith. T his seem s like introducing the experim enters prejudice into the interpretation ofdata! H ot debates continue on this point. T he ghost of the priorhashaunted Bayesian statisticalinference from its birth. Laplace him selfcoined a `principle ofinsu± cient reason'. It was a way of m aking the prior a constant or °at function so as to be as even handed as possible. T his is sim ilarin spiritto ouraccepting 1/2 and 1/6 as the probabilities for coins and dice. B ut when we com e to a continuousvariableq,going from zero to one,do we say it has equalprobabilities to be less than or greater than 12 ? There is a trap here pointed out by Laplace's countrym an B ertrand. W hy not apply the sam e (insuf¯cient!) reasoning to q2;w hich goes from zero to one? W e would then conclude the q would be less than 0.707 w ith probability 12 : C learly one needs further input to decide on a prior in cases like this. So far,we have justtouched the fringes ofentropy concept,w hen we looked atthe logarithm ofthe num berof ways that a given error could occur. But we are now prepared for the basic problem w hich faced Boltzm ann w hen he investigated the theory of gases in the latter halfofthe nineteenth century. T here is detailed discussion in the article by B hattacharjee in this issue, and we only give the bare m inim um needed for this story. B oltzm ann would take the total energy and totalvolum e ofa gasasgiven { these correspond to the data set D .Letusthink ofthe detailed position (x)and velocity (v) inform ation of all the m olecules as our hypothesis H .Boltzm ann (and his great A m erican contem porary, G ibbs) divided the space ofx and v into cells ofequal volum e,m easured by the product dxdv. N otice that he singled outx ratherthan x3;v ratherthan v5:T hisprior was based on his analysis ofthe dynam ics ofcollisions between m olecules. T he rest is history. H e was able to deduce M axwell'sprobability distribution law forthe


2001

15

GENERAL ç ARTICLE

In modern quantum language, one can state Boltzmann’s prior in a physically appealing way. Every single energy level of the whole system gets equal probability to start with.

Interestingly, this is a product of three gaussian distributions for vx , vy , vz. 2

3

If all the letters were equally likely to occur, this number would be W=26 100 .

16

m olecularvelocity com ponents vx ;vy;vz. 2 Even better, he was able to show how collisions would produce such a distribution even ifit was not present to start with. T hese results were in fullagreem ent with experim ents, both earlier and later. In m odern quantum language, one can state Boltzm ann's priorin a physically appealing way. Every single energy levelofthe whole system gets equalprobability to start with. W hile Boltzm ann chose the volum e in x ¡ v space based on classicalcollisions,today we know thatthisisequivalentto counting energy states in quantum theory. W e now m ove forward about half a century to 1948. Stim ulated by rapid advances in electronics,one ofthe best telephone system s in the world was established in the U nited States.M any ofthe new developm entscam e from theB ellTelephoneLaboratories(B ellLabsforshort) and were published in the BellSystem TechnicalJournal. C laude E Shannon, a young researcher at Bell, contributed two papers on the M athem aticalT heory of Com m unication. H is deep insight was to introduce a quantitative m easure ofthe am ount ofinform ation being com m unicated. A fter all,this inform ation is w hat we really pay the telephone com pany for!Ifyou receive a m essage from som eone in the English language,you already know the approxim ate fraction ofE's,T's,A 's, etc.Letussay there are a hundred lettersin a telegram . T here is a large num ber, W , of possible English m essages w ith a hundred letters.3 You open the telegram and ¯nd outwhich one oftheW isyourm essage.Shannon proposed that the inform ation gained be m easured by S = log2 W . The reason for taking the logarithm is the sam e as earlier.T wo successive telegram s(on unrelated subjects!) would correspond to W 1 £ W 2 possible m essages.Shannon'sm easure ensuresthatthe inform ation (and perhaps your telegram bill!) is additive,i.e., S = log2 W 1 + log2 W 2 = S1 + S2: T his is related to Boltzm ann's entropy. H e would call


2001

GENERAL ç ARTICLE

S a m easure of your ignorance before you opened the telegram ,ratherthan yourenlightenm entafteryou opened it. B ut it is sensible to take the two quantities as equivalent. W hy choose 2 as the base of logarithm s? The sim plestsituationsare when the m essage sim ply saysw hich of two (equiprobable) options was realised. W hen the nurse stepsoutofthe m aternity ward and tellsthe anxious father Ìts a girl',W = 2, and S= 1. T his is called one bit ofinform ation. Everyone in this com puter age knows that `bit' is short for `binary digit', som ething w hich takes2 values,zero and one. Shannon's concept of inform ation took the world by storm . T here was trem endous enthusiasm to apply it to every ¯eld. A n indignant journaleditor even wrote the following lines{ \W e w illno longerconsiderpapers w ith titles like inform ation theory,photosynthesis,and religion"! W e have presented Shannon's work in conjunction with theideasofB ayesand Boltzm ann.Thisattem ptatcom plete synthesisactually cam ea few yearsafterShannon, in the in°uentialwork ofthe physicist Edwin T Jaynes. H e and his followers have explored the application of `m axim um entropy'(asthey callthisapproach)to a variety ofpracticalproblem s. Both Shannon and Jaynes


2001

Figure 3. The photograph... shows the effect of applying maximum entropy deconvolution to a motionblurred picture. Processing by A Lehar and Maximum Entropy Data Consultants Ltd. for the UK Home Office. Our thanks to Steve Gull and his colleagues at the Mullard Radio Astronomy Observatories in Cambridge who were instrumental in developing maximum entropy methods for such problems.

17

GENERAL ç ARTICLE

Figure 4.

died recently, living to see their ideas bear fruit over nearly halfa century. A lthough we cannot give details here, inversion based on m axim um entropy m ethods is in wide use. A dram atic reallife exam ple (Figure 3) would be a blurred photograph ofa car. In this context,the blurred photograph stands for the data D ,w hile the reconstructed picture correspondsto H .A fterinversion,oneisable to read the num berplate clearly!A n exam ple ofrem oving blurring in astronom y using prior inform ation is given in Figure 4.

Address for Correspondence R Nityananda National Centre for Radio Astrophysics Post Bag 3, Ganeshkhind Pune 411 007, India

18

W e m ust ofcourse rem em berthatm axim um entropy is nota m agic wand.T hefactthatwe are ableto read the num ber plate m eans that the inform ation in the data (blurred photograph),plusthe inform ation in the prior, were enough to recover what we were looking for. In a given problem , there is usually a range of possible priors which would be regarded as reasonable. M ost workers would regard results which are insensitive to choices in this range as genuine. W hen results start becom ing sensitive to the prior,itistim e to go outand getm ore data orwork on a di®erent problem .


2001