Cognitive Foundations of Inductive Inference and Probability: An ...

Cognitive Foundations of Inductive Inference and Probability: An Axiomatic Approach¤ Itzhak Gilboayand David Schmeidlerz March 2000

Abstract We suggest an axiomatic approach to the way in which past cases, or observations, are or should be used for making predictions and for learning. In our model, a predictor is asked to rank eventualities based on possible memories. A \memory" consists of repetitions of past cases, and can be identi¯ed with a vector, attaching a nonnegative integer (number of occurrences) to each case. Mild consistency requirements on these rankings imply that they have a numerical representation that is linear in the number of case repetitions. That is, there exists a matrix assigning numbers to eventuality-case pairs, such that, for every memory vector, multiplication of the matrix by the vector yields a numerical representation of the ordinal plausibility ranking given that memory. Interpreting this result for the ranking of theories or hypotheses, rather than of speci¯c eventualities, it is shown that one may ascribe to the predictor subjective conditional probabilities of cases given theories, such that her rankings of theories agree with their likelihood functions. As opposed to standard approaches, in our model there ¤

We thank Didier Dubois and Peter Wakker for conversations that motivated this work; Daniel Lehman, Ariel Rubinstein, and Peyton Young for speci¯c comments and examples; and Edi Karni and Gerda Kessler for many detailed comments. This paper contains most of the material in two earlier papers, \Inductive Inference: An Axiomatic Approach" and \Cognitive Foundations of Probability". This material is also presented as a Web paper at http://www.tau.ac.il/~igilboa/Inductive Inference/Index.html. y Tel-Aviv University. [email protected] z Tel-Aviv University and The Ohio State University. [email protected]

1

is a complete dichotomy between the set of objects that are ranked (eventualities or theories) and the space in which mathematical properties, such as convexity or continuity, are de¯ned and used (the space of memories). This dichotomy allows many additional interpretations and applications. Among them are also derivation of subjective probability and of frequentism, derivation of utility, and derivations of case-based decision theory and of expected utility theory.

1

Introduction

MOTIVATION Prediction is a fundamental cognitive function. It is one of the hallmarks of intelligence. It is essential to reasoning and to decision making. Some view prediction as the main goal of science. How do and how should we generate predictions? People often resort to statistical inference. Indeed, statistics o®ers two main methodologies for prediction. The ¯rst, which is typically associated with classical statistics, is frequentist. It suggests that probabilistic prediction be based on observed empirical frequencies. The second, closely related to Bayesian statistics, argues for the formation of a prior probability, to be updated in light of observations by Bayes rule. Both approaches have certain drawbacks, as normative as well as descriptive theories. The frequentist approach limits itself to situations that are repeated in seemingly identical conditions. Further, it does not incorporate subjective beliefs, intuition, and the like. The Bayesian approach applies to all conceivable prediction problems. But it says very little about the way in which one can form a prior. The contributions of Ramsey (1931), de Finetti (1937), and Savage (1954) argue that Bayesian expected utility maximization is the only normatively acceptable decision rule, and that in-principle-observable preferences can uniquely de¯ne a prior. Probabilities have also been axiomatically derived from qualitative plausibility judgments, where the latter are modeled as binary relations (see Kraft, Pratt, and Seidenberg (1959), Krantz, Luce, Suppes, and Tversky 2

(1971), Fine (1973), Fishburn (1986)) or as propositions (see Fagin, Halpern, and Megiddo (1990), Fagin and Halpern (1994), Aumann (1995), Heifetz and Mongin (1999)).1 These axiomatic derivations do not explicitly model the information based on which a prior is formed. Further, they do not attempt to provide an account of this cognitive process. Thus, the axiomatizations of Bayesian beliefs and of Bayesian expected utility maximization may convince you that you would like to be Bayesian, but they do not provide the self-help tools that are needed to become a practising Bayesian. The goal of this paper is to model explicitly the link between factual knowledge and derived beliefs. Alternatively, from the viewpoint of the frequentist approach, our aim is to generalize the notion of empirical frequencies to situations that are not repeated under precisely the same conditions. A few examples may help to illustrate our goals. Example 1: A die is rolled over and over again. As far as the predictor can tell, or to the best of her knowledge, all rolls were made under identical conditions. Also, the predictor does not know of any reason to consider any outcome more likely than any other. The frequentist approach seems reasonable in this case. Moreover, the Bayesian approach, starting out with a positive probability for each outcome and assuming i.i.d. rolls, will converge to the empirical frequencies as well. For this case there does not seem to be a need for any new theory. We will only use it as a benchmark, and expect a theory of prediction to coincide with the frequentist approach in this case. Example 2: The Little Prince travels in space and reaches a new planet. This is his ¯rst stop since he left his own planet and his rose. He may wonder whom he will encounter on this planet, whether it will be a fox or a king (assuming he can conceive of these), or something completely di®erent. He has no knowledge on which to base his beliefs. The frequentist approach remains silent on this issue. The Bayesian approach shames the Little Prince 1

The idea of postulating qualitative probabilities as a basis for numeric probabilities originates with de Finetti, and constitutes a main step in Savage's derivation of subjective probabilities.

3

into forming a prior, but gives him no indication as to how to do it. As a test of consistency, one may expect the Little Prince to argue that meeting a king is just as likely as meeting a fox. But we do not expect a theory of prediction to say much that is enlightening about this case of complete ignorance. Example 3: A physician is asked by a patient if she predicts that surgery will succeed in his case. The physician knows whether the procedure succeeded in most cases in the past, but she will be quick to remind her patient that every human body is unique. Indeed, the physician knows that the statistics she read included patients who varied in terms of age, gender, medical condition, and so forth. It would therefore be too naive of her to quote statistics as if the empirical frequencies were all that mattered. On the other hand, if the physician considers only past cases of patients that are identical to hers, she will probably end up with an empty database. Example 4: An expert on international relations is asked to predict the outcome of the con°ict in Kosovo. She is expected to draw on her vast knowledge of past cases, coupled with her astute analysis thereof, in forming her prediction. Yet she is not expected to simply compute how often military con°icts were resolved in this way or the other. Examples 3 and 4 present intermediate cases between repeated situations and complete ignorance. In both examples, available databases provide relevant information, but not all pieces of information are relevant to the same degree. We seek a theory of prediction that will make use of the available information, but will allow di®erent past cases to have di®erential relevance to the prediction problem.

THE PREDICTION RULE Consider the following prediction rule, say, for Example 3. The physician considers all known cases of successful surgery. She uses her subjective judgment to evaluate the similarity of each of these cases to the patient she is treating, and she adds them up. She then does the same for unsuccessful treatments. It seems reasonable that the outcome with the larger aggregate similarity value will be her prediction. 4

Similarly, the expert in Example 4 may base her prediction on all known cases, using her expertise in two ways: (i) in recalling past cases; (ii) in assessing their similarities to the situation at hand. Observe that, in the special case in which all past cases are deemed to be equally relevant to the present case, this prediction rule coincides with a frequentist prediction, that is, with the mode of a frequentist probability. On the other hand, if the database of past cases is empty, this rule will not di®erentiate between di®erent eventualities. This rule allows subjective judgment in assessing similarities of cases. Yet, it does not insist that subjective judgments be speci¯c enough to form a Bayesian prior in the absence of information.

AXIOMATIZATION The ¯rst goal of this paper is to axiomatize this rule. We assume that a predictor has a ranking of possible eventualities given any possible memory (or database). A memory consists of a ¯nite set of past cases, or stories. These are not assumed to have any particular structure. However, we do assume that any case, such as \the die came out on 4" can appear in memory any number of times. Our main assumption is that prediction satis¯es a combination axiom. Roughly, it states that if an eventuality a is more likely than an eventuality b given two possible disjoint memories, then a is more likely than b also given their union. We also assume that the predictor's ranking is Archimedean in the following sense: if a database I renders eventuality a more likely than eventuality b, then for every other database J there is a su±ciently large number of replications of I that, when added to J, will make a more likely than eventuality b. That is, even if the predictor has reason to believe that b is more likely than is a, a database that supports a vs. b can be multiplied until it overwhelms the initial assessment. Finally, we need an assumption of diversity, stating that any list of four eventualities may be ranked, for some conceivable database, from top to bottom. Together, these assumptions necessitate that prediction be made according to the rule suggested above. Moreover, we show that the similarity 5

evaluations underlying the prediction are almost unique (Theorem1). This result can be interpreted in several ways. From a descriptive viewpoint, one may argue that experts' predictions tend to be consistent as required by our axioms (of which the combination is the most important), and that they can therefore be represented as aggregate similarity-based predictions. From a normative viewpoint, our result can be interpreted as suggesting the aggregate similarity-based predictions as the only way to satisfy our consistency axioms. In both approaches, one may attempt to measure similarities using the likelihood rankings given various databases.

DERIVATION OF MAXIMUM LIKELIHOOD Prediction is not restricted to single cases. Often one is asked to choose not only the most plausible outcome in a given instance, but the most plausible theory, or hypothesis. Hume's problem can be framed not only as whether the sun will rise tomorrow, but also as whether the theory that the sun always rises is a plausible theory to hold. Needless to say, cases, or observations, are a basis for selecting theories as well. How should we use cases in this problem? We argue that our axioms are reasonable for this case as well. Indeed, the main axiom is that, if, based on each of two disjoint databases, we tend to prefer theory T to theory T 0 , we should have the same preference based on the union of the databases. Hence our rule is also a reasonable suggestion for ranking theories given data: every theory and every case are ascribed a number, and the plausibility of a theory is measured by the summation of the numbers corresponding to it, over all cases (including repetitions of cases). Let us suppose that the numbers assigned to theory-case pairs are negative. They can then be viewed as the logarithms of the conditional probabilities of cases given theories. Summing up these numbers over all cases corresponds to multiplying the conditional probabilities. In other words, the numerical function that measures the plausibility of a theory is simply the log-likelihood function. Thus our theorem can be viewed as an axiomatization of likelihood rank6

ing. Given a qualitative \at least as plausible as" relation between theories we derive numerical conditional probabilities for each theory and each case, together with the algorithm that ranks theories based on their likelihood function, under the assumption that all cases were statistically independent. While the conditional probabilities we derive are unique only up to certain transformations, our result does provide a compelling reason to use likelihood rankings.

DERIVATION OF SUBJECTIVE PROBABILITIES We also address the case in which the objects to be ranked have a logical or algebraic structure. That is, instead of eventualities, which can be viewed as states of the world, we assume that the predictor ranks events. The diversity axiom stated above is not very plausible for this case, since it requires that, for some memories, an event be ranked more plausible than a superset thereof. Hence we start by restricting the diversity axiom is a way that takes into account set calculus, and we then show that a similar result still holds (Theorem 6). That is, one can represent plausibility rankings of events by numerical functions, each of which is linear in case repetitions. This may be viewed as a derivation of not-necessarily-additive subjective probabilities, trying to describe the way such probabilities are generated from past cases. The algebraic structure of events allows us to impose an additional condition of event cancellation a la de Finetti. Speci¯cally, it states that, for every memory, the ranking between two events is not altered when a common subevent is \deleted". The next major result (Theorem 7) is that this condition is equivalent to the condition that the similarity values, for each case, are additive with respect to the union of disjoint sets. It therefore describes the way that a predictor, who is committed to this cancellation condition, may form (additive) probabilistic beliefs over events given any possible memory. We then proceed to test our model in the benchmark example of frequentism. That is, we assume that each case observed in the past can only be one of the possible eventualities in the problem at hand. Under this structural 7

assumption it is natural to state two additional assumptions on plausibility rankings, which are shown to be equivalent to frequentism, namely, to ranking events by their empirical frequencies (Theorem 10). While it is reassuring to know that frequentism is a special case of our model, we consider it a conceptually simple problem. The more interesting problems, as in Examples 3 and 4 above, are those in which each past case is pertinent to the present problem to a certain subjective degree. Our results show how one may form probabilistic beliefs based on partially relevant information. Conversely, they also show how qualitative \at least as plausible as" comparisons may be used to elicit the subjective similarity judgments, and when these can be assumed additive with respect to set union.

DECISION THEORIES In the above we focus on the formation of probabilities as a general-purpose tool for the quanti¯cation of plausibility judgments. It is natural, however, to wonder how our derivation of probabilities relates to the classical decision-theoretic derivations thereof, which are typically coupled with expected utility maximization. To this end, we ¯rst provide a new characterization of expected utility maximization under risk (Theorems 12-13), and consider a process in which the decision maker ¯rst forms probabilistic beliefs and then uses them to reduce a decision problem under uncertainty to a decision problem under risk. We o®er this as a possible cognitive account of how people might come to behave as expected utility maximizers. One may also re-interpret our basic result (Theorem 1) as dealing with ranking of decisions, or acts, rather than of eventualities. As such, this result o®ers a generalized case-based decision theory. It states that for every case and every act there is a number that measures the degree of support that the case lends to the act, and that the decision maker chooses an act with the highest aggregate support. This theory is purely behaviorist in that it makes no reference to cognitive constructs such as \probability", \utility", or \similarity". It can be speci¯ed to include such constructs, where expected 8

utility theory and case-based decision theory (Gilboa and Schmeidler (1995)) are special cases thereof. The rest of this paper is organized as follows. Section 2 axiomatizes the prediction rule. Section 3 deals with maximum likelihood rankings of theories. We devote Section 4 to a discussion of the axioms, in which we present several classes of examples in which they may fail to be reasonable. Section 5 extends the theory to ranking of events, and Section 6 specializes that ranking to the case of frequentist rankings. Section 7 shows that minor variations on our basic result give rise to expected utility maximization under risk. In Section 8 we unify the discussion in the previous sections to derive a cognitive account of expected utility maximization, and to discuss the generalized case-based decision theory. Section 9 concludes with an alternative interpretation of the basic result, deriving scoring rules in voting theory. Finally, an appendix contains all proofs.

2 2.1

Ranking Eventualities The framework

The primitives of our model consist of two non-empty sets X and M and a structure based on them. The set M is assumed to be ¯nite. In the informal discussion one may think of X as being ¯nite too. Elements of X are termed eventualities. Their ranking by a binary relation, \at least as likely as", is the task of the predictor. To carry out this task, the predictor is equipped with knowledge of cases. The latter are the elements of M . While evaluating likelihoods, it is insightful not only to know what has happened, but also to take into account what could have happened. The predictor is therefore assumed to have a well-de¯ned \at least as likely as" relation on X for many other collections of cases in addition to M itself. We represent these conceivable collections of cases through replications or repetitions of the original cases in M: In other words, we tax the predictor's 9

imagination to some extent by assuming that observations occurred several times instead of once, but we don't force her to consider cases that have never happened. Formally denote J = ZM + = fIjI : M ! Z+ g where Z+ stands for the non-negative integers. We assume that for every I 2 J the predictor has a binary relation \at least as likely as" on X denoted by %I (i.e., %I ½ X £ X). In other words, we are formalizing here a rule, as opposed to an act, of prediction. The rule generates predictions not only for the available set of cases, but also for hypothetical ones. Some of the hypothetical collections of cases, i.e., vectors in J, may become actual when the predictor acquires additional information. (On this see also Subsection 2.4.) In conclusion, our structural assumption (axiom 0) is the existence of a binary relation %I on X for all I 2 J. A note on nomenclature: the main result of this section is interpreted as a representation of a prediction rule. Accordingly, we refer to a \predictor" who may be a person, an organization, or a machine. However, the result may and will be interpreted in other ways as well. Instead of ranking eventualities one may rank decisions, acts, or a more neutral term, alternatives. Cases, the elements of M; may also be called observations or facts. A vector of repetitions I in J represents the predictor's knowledge and will be referred to as memory or database.

2.2

The axioms

We will use the four axioms stated below. In their formalization let ÂI and ¼I denote the asymmetric and symmetric parts of %I , as usual. %I is complete if x %I y or y %I x for all x; y 2 X. Finally, algebraic operations on J are performed pointwise. A1 Order: For every I 2 J, %I is complete and transitive on X.

A2 Combination: For every I; J 2 J and every a; b 2 X, if a %I b (a ÂI b) and a %J b, then a %I+J b(a ÂI+J b). 10

A3 Archimedean Axiom: For every I; J 2 J and every a; b 2 X, if a ÂI b, then there exists l 2 N such that for all k ¸ l, a ÂkI+J b. Axiom 1 simply requires that, given any conceivable memory, the predictor's likelihood relation over eventualities is a weak order. Axiom 2 states that if eventuality a is more plausible than eventuality b given two disjoint memories, a should also be more plausible than b given the combination of these memories. In our set-up, combination (or concatenation) of memories takes the form of adding the number of repetitions of each case in the two memories. Axiom 3 is a continuity condition. It states that if, given the memory I, the predictor believes that eventuality a is strictly more plausible than b, then, no matter what is her ranking for another memory, J, there is a number of repetitions of I that is large enough to overwhelm the ranking induced by J. Finally, we need a diversity axiom that is not necessary for the functional form we would like to derive. While the theorem we present is an equivalence theorem, it characterizes a more restricted class of plausibility rankings than those discussed in the introduction. Speci¯cally, we require that for any four eventualities, there is a memory that will distinguish among all four of them. A4 Diversity: For every list (a; b; c; d) of distinct elements of X there exists I 2 J such that a ÂI b ÂI c ÂI d. If jXj < 4, then for any strict ordering of the elements of X there exists I 2 J such that ÂI is that ordering.

2.3

The basic result

The key result of this paper can now be formulated. Theorem 1 : Let there be given X, M, and f%I gI2J as above. Then the following two statements are equivalent if jXj ¸ 4 : (i) f%I gI2J satisfy A1-A4; (ii) There is a matrix v : X £ M ! R such that: 11

(¤¤)

for every I 2 J and every a; b 2 X, a %I b i®

P

c2M I(c)v(a; c) ¸

P

c2M

I(c)v(b; c) ,

and for every list (a; b; c; d) of distinct elements of X, the convex hull of di®erences of the row-vectors (v(a; ¢) ¡ v(b; ¢)); (v(b; ¢) ¡ v(c; ¢)); and (v(c; ¢) ¡ v(d; ¢)) does not intersect RM ¡. Furthermore, in this case the matrix v is unique in the following sense: v and w both satisfy (¤¤) i® there are a scalar ¸ > 0 and a matrix u : X £ M ! R with identical rows (i.e., with constant columns) such that w = ¸v + u . Finally, in the case jXj < 4, the numerical representation result (as in (¤¤)) holds, and uniqueness as above is guaranteed. The main point of the theorem is that the in¯nite family of rankings has a linear representation via a real matrix of order jXj £ jM j: For (a; c) 2 X £ M the number v(a; c) is interpreted as the amount of support that case c lends to eventuality a. It will be referred to as similarity. Note, however, that it may be negative. Speci¯cally, the uniqueness part of the theorem states that the similarities may be rescaled by changing the units by a multiplicative positive coe±cient and adjusting the origin by an additive constant, possibly di®erent for each case. Given any real matrix of order jXj £ jM j, one can de¯ne for every I 2 J a weak order on X through (¤¤): It is easy to see that A2 and A3 hold and that A4 may not hold. For example, A4 will be violated if a row of the matrix dominates another row. Hence a counter-example proving the following proposition is called for. Proposition 2 Axioms A1, A2, and A3 do not imply the existence of a matrix v satisfying (¤¤): In other words, A1,A2,A3 are necessary but insu±cient conditions for the existence of a matrix v satisfying (¤¤): 12

2.4

Learning

There are several aspects of the primitives of our model that require clari¯cation, elaboration and extension. As pointed out in 1.1, the prediction rule is constructed in anticipation of incorporating new information. Let us assume that the present information is I 2 ZM + and the ranking of eventualities is %I . The predictor acquires a new database, say J, and his updated M0 0 0 ranking is %I+J . Suppose now that J 2 = ZM + , but J 2 Z+ with c 2 M nM and J(c0 ) > 0. The combination axiom and the representation theorem do not apply and updating is not possible, because, for a di®erent memory M 0 , the theorem would yield a matrix v 0 that may be completely unrelated to the matrix v of the memory M . This would imply that similarity judgments between two cases may depend not only on their intrinsic features, but also on other cases in memory. Should one want to rule out this dependence and allow learning, one should require an additional axiom. Before we introduce it, a piece of notation will be useful: for two memories M and M 0 satisfy0 M0 ing M ½ M 0 , and for each I 2 ZM + , let I 2 Z+ be the extension of I to M 0 de¯ned by I(c) = 0 for c 2 M 0 nM. We can now state the condition guaranteeing that the similarity function is independent of memory: A5 Case Independence: If M ½ M 0 , then %I = %I 0 for all I 2 ZM +. We now state the sought for result. Its proof is quite simple and hence omitted. Proposition 3 : Let there be given a set of cases C, a set of alternatives X, and a family M of ¯nite subsets of C . Assume that for every M 2 M, f%I gI2ZM , satisfy A1-A4, that jXj ¸ 4; and that M is closed under union. + Then the following two statements are equivalent: (i) A5 holds on M; ( ii) For every c 2 C and every a 2 X there exists a number v(a; c) such that (¤¤) of Theorem 2.1 (ii) holds for every M 2 M. 13

Axiom 5 and Proposition 2 enable us to resolve an additional modeling di±culty within our framework. From the description in Subsection 2.1 it seems that the initial information of the predictor is M or, equivalently, 2 1M 2 ZM Assume, for instance, that the observed cases are 5 °ips of a +. coin, say HT HHT . Are these ¯ve separate cases, with M = fci ji = 1; :::; 5g and I = (1; 1; 1; 1; 1)? Or is the correct model one where M ` = fH; T g and I = (3; 2) 2 ZM` + ? Or should we perhaps opt for a third alternative where the predictor's knowledge is J = (3; 2; 0; 0; 0) 2 ZM + ? It would be nice to know that these modeling choices do not a®ect the result. First we note that, indeed, under Axiom 5 and the proposition, the rankings of X for (3; 2) and for (3; 2; 0; 0; 0) should be identical. Second, if the predictor believes that c1 ; c3 ; c4 2 M are the \same" case, and so are c2 ; c5 2 M , we should also expect the rankings for (1; 1; 1; 1; 1) and for (3; 2; 0; 0; 0) to be identical. Moreover, if this is the case also whenever we \replace" occurrences of c1 by occurrences of, say, c3 , we may take this as a de¯nition of identicality of cases. Formally, for two cases c1 ; c2 2 M, de¯ne c1 ' c2 if the following holds: for every I; J 2 J such that I(c) = J(c) for all c 6= c1 ; c2 , and I(c1 ) + I(c2 ) = J(c1 ) + J(c2 ), we have %I =%J . In conclusion we state the obvious: Proposition 4 (i) ' is an equivalence relation on M. (ii) If (¤¤) holds then: c1 ' c2 , 9¯ 2 R s.t., 8a 2 X; v(a; c1 ) = v(a; c2 ) + ¯ It follows that one may start out with a large set of distinct cases, and allow the predictor's subjective rankings to reveal possible equivalence relations between them. Then using Proposition 1 it is possible to have an equivalent representation of the knowledge, based on the smallest set of different cases. Thus for the minimal representing matrix v; not only are all 2

1E stands for the idicator vector of a set E in the appropriate space. In our model we identify M with 1m 2 ZM +.

14

columns di®erent, but no column is obtained from another by addition of a constant.

3

Ranking Theories: Maximum Likelihood

The main interpretation of the results in Section 2 deals with ranking speci¯c eventualities. For instance, if one is to predict the result of the con°ict in Kosovo, one is modeled as summing up similarity values of the situation in Kosovo to past cases such as the Vietnam War, the Gulf War, and the like. However, it is often the case that one would like to think in more general terms than particular cases. For various reasons one may wish to formulate beliefs in the form of general rules, or theories, and to rank these. Our international relations expert may be asked, for instance, not only what will happen in Kosovo, but also whether con°icts of this nature generally get resolved in peaceful ways. For the purposes of knowledge organization as well as for argumentative purposes, one may ¯nd it useful to be able to conduct the discussion in terms of rules. Scienti¯c activity, for instance, is generally expected to be conducted at a certain level of generality. How, then, should we rank theories based on past observations? Assume that the set X in the model of Section 2 is, indeed, a set of possible theories, or laws. Observe that the axioms we formulated apply to this case as well. In particular, our main requirements are that theories be ranked by a weak order for every memory, and that, if theory a is more plausible than theory b given each of two memories, a should also be more plausible than b given the combination of these memories. Assume, therefore, that Theorem 1 holds. Choose a representation v where v (a; c) < 0 for every theory a and case c. De¯ne p (cja) = exp (v (a; c)), so that log (p (cja)) = v (a; c). Our result states that, for every two theories a; b: P P (¤¤) a %I b i® c2M I(c)v(a; c) ¸ c2M I(c)v(b; c), 15

which is equivalent to exp or

¡P

¢ ¡P ¢ I(c)v(a; c) ¸ exp I(c)v(b; c) c2M c2M

Q

c2M

(p (cja))I(c) ¸

Q

c2M

(p (cjb))I(c)

In other words, should a predictor rank theories in accordance with A1A4, there exist conditional probabilities p (cja), for every case c and theory a, such that the predictor ranks theories as if by their likelihood functions, under the implicit assumption that the cases were stochastically independent. On the one hand, this result can be viewed as a normative justi¯cation of the likelihood rule: any method of ranking theories that is not equivalent to ranking by likelihood (for some conditional probabilities p (cja)) has to violate one of our axioms. On the other hand, our result can be descriptively interpreted, saying that likelihood rankings of theories are rather prevalent. One need not consciously assign conditional probabilities p (cja) for every case c given every theory a, and one need not know probability calculus in order to generate predictions in accordance with the likelihood criterion. Rather, whenever one satis¯es our axioms, one may be ascribed conditional probabilities p (cja) such that one's predictions are in accordance with the resulting likelihood functions. Thus, relatively mild consistency requirements imply that one predicts as if by likelihood functions. Finally, our result may be used to elicit the subjective conditional probabilities p (cja) of a predictor, given her qualitative rankings of theories. However, our uniqueness result is somewhat limited. In particular, for every case c one may choose a positive constant ¯ c and multiply p (cja) by ¯ c for all theories a, resulting in the same likelihood rankings. Similarly, one may choose a positive number ® and raise all probabilities fp (cja)gc;a to the power of ®, again without changing the observed ranking of theories given possible

16

memories. Thus there will generally be more than one set of conditional probabilities fp (cja)gc;a that are consistent with f%I gI2J . The likelihood function relies on independence across cases. Conceptually, stochastic independence follows from two assumptions in our model. First, we have de¯ned f%I gI2J where each I counts case repetitions. This implicitly assumes that only the number of repetitions of cases, and not their order, matters. This structural assumption is reminiscent of de Finetti's exchangeability condition (though the latter is de¯ned in a more elaborate probabilistic model). Second, our combination axiom also has a °avor of independence. In particular, it rules out situations in which past occurrences of a case make future occurrences of the same case less (or more) likely.

4

Discussion of the Axioms

While our axioms are rather plausible, there are applications in which they do not appear compelling. We discuss here several examples, trying to delineate the scope of applicability of the axioms, and to identify certain classes of situations in which the axioms may not apply. In these situations one should take the linear aggregation rule with a grain of salt, for descriptive and normative purposes alike. In the following discussion we do not dwell on the ¯rst axiom, namely, that plausibility rankings are weak orders. This axiom and its limitations have been extensively discussed in decision theory, and there seems to be no special arguments for or against it in our speci¯c context. We also have little to add to the discussion of the diversity axiom. While it does not appear to pose conceptual di±culties, there are no fundamental reasons to insist on its plausibility either. One may well be interested in other assumptions that would allow a similar numerical representation. We therefore focus on the combination and continuity assumptions. Mis-speci¯ed Cases One class of examples in which our assumptions

17

appear implausible has to do with mis-speci¯ed cases. For instance, consider a cat, say Lucifer, who every so often dies and then may or may not resurrect. Suppose that many other cats have been observed to resurrect exactly nine times throughout history. If Lucifer had died and resurrected four times, and now died for the ¯fth time, we'd expect him to resurrect again. But if we double the number of cases, implying that we are now observing the ninth death, we would not expect Lucifer to be with us again. Thus, one may argue, the combination axiom does not seem to be very compelling. But we may also draw a di®erent conclusion from this example, namely, that one should de¯ne \cases" in an informative enough way to guarantee their independence. For instance, if Alice and Bob play tennis, and Alice has succeeded to surprise Bob once with a clever shot, it would be silly of her to assume that repeating the same shot will surprise him again. While Bob may fail to learn from his bitter experience, Alice should obviously conceptualize the situation as \playing after the secret weapon has been used", and this should di®er from the situation \playing before the secret weapon has been used". Similarly, a veteran cat who has died eight times is not very similar to a novice cat who experiences death for the ¯rst time. If one does make sure that evidently relevant information is incorporated into the description of the cases, the combination axiom appears quite reasonable. In this case one is led to ask, however, when is a description informative enough? After all, one can imagine a level of detail that would make every case unique and would render the notion of repetition vacuous. For instance, one may index every case by the exact time at which one became aware of it, and then ¯nd that no case can ever repeat itself in principle. Our notion of repetition implicitly assumes that such subjective factors are deemed irrelevant for prediction. We do not ¯nd this assumption very restrictive, especially because any form of inductive or probabilistic reasoning relies on subjective assessments of similarity, according to which one may de¯ne such concepts as \repetition" or \identical conditions". Moreover, for normative

18

purposes it seems rather compelling that one be able to coherently deal with repetitions, ignoring subjective factors that arti¯cially di®erentiate them. To sum, the degree to which our axioms are reasonable invariably depends on the way cases are de¯ned. For our model to make sense, the description of cases should be detailed enough to justify the combination axiom, yet it should be general enough to allow repetitions. Mis-speci¯ed Theories When applying the combination axiom to general theories, rather than to concrete eventualities, one has to verify that the theories are also well-speci¯ed. For instance, suppose that one wishes to determine whether a coin is biased. A memory with 1,000 repetitions of \Head", as well as a memory with 1,000 repetitions of \Tail" both suggest that the coin is indeed biased, while their union suggests that it is not.3 As mentioned above, our combination axiom makes an implicit assumption of stochastic independence. Under this assumption, it is highly unlikely to observe such memories. That is, adopting the combination axiom entails an implicit assumption that such anomalies will not occur. But this example also shows that ambiguity in the formulation of theories may make us reject the combination axiom. Speci¯cally, the theory that the coin is \biased", without specifying in what way it biased, may be viewed as a mis-speci¯ed theory. Theories about Patterns A related class of examples deal with concepts that describe, or are de¯ned by patterns, sequences, or sets of cases. Assume that a single case consists of ten tosses of a coin. A highly complex sequence of tosses may lend support to the hypothesis that the coin generates random sequences. But many repetitions of the very same sequence would undermine this hypothesis. Observe that \the coin generates random sequences" is a statement about sequences of cases. Similarly, statements such as \The weather always surprises" or \History repeats itself" are about sequences of cases, and are therefore likely to generate violations of the com3

This example is partly inspired by an example of Daniel Lehman.

19

bination axiom. Strategic Choice of Observations From a normative point of view, our theory may not be very compelling when the predictor has some control over the set of observations. For instance, assume that a scientist has to choose between theory a and theory b. Conducting experiment 1 typically results in a case m1 , which appears more plausible if a is true. Conducting experiment 2 leads to a case m2 , which lends greater support to theory b. By deciding which experiment to run, the scientist can choose her beliefs. While choosing one's own beliefs is typically not considered rational, attempting to change someone else's beliefs by selective reminding of cases is rather prevalent, as in legal argumentation. Indeed, our theory may explain why such techniques prove e®ective. Overwhelming Evidence There are situations in which a single case may outweigh any number of repetitions of other cases, in contradiction to the Archimedean axiom. For instance, a physician may ¯nd a single observation, taken from the patient she is currently treating, more relevant than any number of observations taken from other patients.4 In the context of ranking theories, it is possible that a single case m constitutes a direct refutation of a theory a. If another theory b was not refuted by any case in memory, a single occurrence of case m will render theory a less plausible than theory b regardless of the number of occurrences of other cases, even if these lend more support to a than to b.5 In such a case, one would like to assign conditional probability of zero to case m given theory a, or, equivalently, to set v(a; c) to ¡1. More generally, one may extend Theorem 1 to provide representations by non-standard numbers, allowing several levels of \impossibility" as well. Small Numbers The combination and continuity axioms may also be problematic when memories with very few repetitions are involved. For instance, consider the example above where memory I contains only one case 4

Indeed, the nearest neighbor approach to classi¯cation problems violates the Archimedean axiom. 5 This example is due to Peyton Young.

20

m1 and memory J contains many repetitions of both m1 and m2 . Suppose that J renders theory b more plausible than theory a. The Archimedean axiom requires that adding a large enough number of repetitions of m1 to J result in ¯nding a more plausible than b. But this may be counter-intuitive if case m2 teaches us something that is qualitatively new. For example, theory a might be that all objects are drawn to Earth, while theory b { that they are drawn to all other objects to some degree de¯ned by their masses and the distance between them. Having observed only objects that fall down to earth (m1 ), a may be more plausible than b, if only because of its simplicity. But observing several cases of planets being drawn to each other may render b more plausible. This ranking will not change by observing even more objects that fall down to earth when dropped. Observe that the problem here may not be the continuity axiom per se. Rather, one may argue that our model is inadequate to analyze this example since it implicitly assumes that the predictor can always conceive of all theories, whereas, given only memory I, the predictor is unlikely to even conceive of theory b. Alternatively, one may interpret the set M as the cases the predictor conceives of, and then reject A5 on the grounds that similarity judgments may change when we enlarge the set of cases one is aware of. Indeed, observing the case m2 may change the way we evaluate the relevance of case m1 , as well as the way that we represent and conceptualize it. Another example that hinges on small numbers is the following.6 Case m1 is a toss of a coin that resulted in the outcome Head. Case m2 is a toss of the same coin resulting in Tail. Let I be a memory with a single case m1 . You are asked to rank two predictions: a, that all the next 1,000 tosses will result in Head, versus b, that there will be at least one Tail in the next 1,000 tosses. Given memory I, one may ¯nd a less likely than b. Given a billion repetitions of m1 , however, a becomes more likely than b, contradicting the combination axiom. This example raises a question regarding the cases one 6

This example is due to Ariel Rubinstein.

21

has in memory. One may argue that if, indeed, memory I represents all that the predictor has ever observed, a would have been considered more likely than b even under I. The fact that we do not expect 1,000 tosses to come up Head is probably due to the fact that we have observed many other coins, studied probability, and so forth. That is, the memory that makes a less plausible than b contains more than just a single occurrence of m1 . In this case, this example may be re-formulated as a counter-example to the continuity, rather than to the combination axiom. Second-Order Induction An important class of examples in which we should expect the combination axiom to be violated, for descriptive and normative purposes alike, involves learning of the similarity function. For instance, assume that one database contains but one case, in which Mary chose restaurant a over b.7 One is asked to predict what John's decision would be. Having no other information, one is likely to assume some similarity of tastes between John and Mary and to ¯nd it more plausible that John would prefer a to b as well. Next assume that in a second database there are no observed choices (by anyone) between a and b. Hence, based on this database alone, it would appear equally likely that John would prefer a to b. Assume further that this database does contain many choices between other pairs of restaurants, and it turns out that John and Mary consistently choose di®erent restaurants. When combining the two databases, it makes sense to predict that John would choose b over a. This is an instance in which the similarity function is learned from cases. Linear aggregation of cases by ¯xed weights embodies learning by a similarity function. But it does not describe how this function itself is learned. In Gilboa and Schmeidler (1999) we call this process \second-order induction" and argue that the linear formula should only be taken as a very rough approximation when such a process is involved. 7

This is a variant of an example by Sujoy Mukerji.

22

5

Ranking Events

5.1

On inclusion between events

The predictions discussed in Section 2 deal with abstract eventualities, lacking any logical or algebraic structure. It is natural to ask how similarity-based ranking of prediction relates to basic logical or set operations. In an attempt to address this question, we focus here on the case in which the alternatives to be ranked are events. Let - be a state space and let § be an algebra of events on - that contains all singletons. One can apply Theorem 1 to events, substituting - for X: Obviously, the diversity axiom, A4, applied now to all lists of four events doesn't make any sense. One cannot expect an event A to be strictly more plausible than an event B if A ½ B. Moreover, if one ranks plausibility according to a probability measure and if 1A + 1B · 1C + 1D or 1A + 1C · 1B + 1D (where 1E denotes the indicator function of E 2 §), one cannot expect any memory to induce the ranking A ÂI B ÂI C ÂI D. The proposition that follows shows that the above exceptions are the only ones. Proposition 5 Suppose that (A; B; C; D) are four events in a measurable space (-; §). Then there exists a probability measure P on § such that P (A) > P (B) > P (C) > P (D) i® (i) No event in the list (A; B; C; D) is a subset of a follower in the list; and (ii) Neither 1A + 1B · 1C + 1D nor 1A + 1C · 1B + 1D holds. In view of the proposition we have to weaken the diversity axiom by restricting it to lists of four events satisfying conditions (i) and (ii). The technique of our proof suggests a weakening of the rest of the axioms as well. In order to apply Theorem 1 to events, we need it to be the case that, to any two comparable events, say E and F , another two events, say G and H, 23

can be added, such that the original diversity axiom, A4, holds on the set of four events, fE; F; G; Hg. Since we have to derive A4 from the restricted diversity axiom, any ordering of fE; F; G; Hg has to satisfy condition (i). It is easier to continue the discussion using the following term: two events E; F 2 § are called non-included if F nE 6= ; 6= EnF . Thus the discussion above implies that it su±ces for the relation \at least as likely as" to hold only between non-included events in § (for all I 2 J). Similarly, pairs of events f!g; f-nf!gg may also be excluded from comparability, because for any two events G and H several permutations of ff!g; f-nf!gg; G; Hg violate condition (i) above. (Any event G either includes f!g or is included in -nf!g:) We therefore focus on §0 = fA 2 § j A 6= ; and jAc j > 1g. For every I 2 J a binary relation %I ½ f(E; F ) 2 §0 £ §0 jE and F are non-included g is assumed. We start by re-stating the ¯rst three axioms for events. A1* Order: For every I 2 J, for every pair of non-included events A; B 2 §0 , A %I B or B %I A. Furthermore, if A; B; C 2 §0 are pairwise non-included, then A %I B and B %I C imply A %I C. A2* Combination: For every I; J 2 J and every pair of non-included events A; B 2 §0 , if A %I B (A ÂI B) and A %J B, then A %I+J B(A ÂI+J B). A3* Archimedean Axiom: For every I; J 2 J and every pair of nonincluded events A; B 2 §0 , if A ÂI B, then there exists l 2 N such that, A ÂlI+J B. For the introduction of the fourth axiom we need additional terminology: 0 (A; B; C; D) 2 § 4 is called a list of orderly di®erentiated events if the events in the list are pairwise non-included, and neither 1A + 1B · 1C + 1D nor 1A + 1C · 1B + 1D . Note that \pairwise non-included" is more restrictive than \no event in the list is a subset of a follower in the list". However, in view of our structural assumption non-inclusiveness is necessary for comparability. We can now state: 24

A4* Restricted Diversity: For every list of orderly di®erentiated events (A; B; C; D), there exists I 2 J such that A ÂI B ÂI C ÂI D.

5.2

Main results

We can now state Theorem 6 : Given f%I gI2J as above, the following two statements are equivalent if j-j ¸ 5 : (i) f%I gI2J satisfy A1*-A4*; (ii) There is a matrix v : §0 £ M ! R such that: (¤¤)

if I 2 J and A; B 2 §0 are non-included, then

A %I B

i®

P

c2M

I(c)v(A; c) ¸

P

c2M

I(c)v(B; c) ,

and for every list of orderly di®erentiated events (A; B; C; D), the convex hull of di®erences of the row-vectors (v(A; ¢) ¡ v(B; ¢)); (v(B; ¢) ¡ v(C; ¢)); and (v(C; ¢) ¡ v(D; ¢)) does not intersect RM ¡. Furthermore, in this case the matrix v is unique in the following sense: v and w both satisfy (¤¤) i® there are a scalar ¸ > 0 and a matrix u : §0 £M ! R with identical rows (i.e., with constant columns) such that w = ¸v + u . This result states that a method that ranks events by their likelihood, given any possible repetition of known cases, has to be equivalent to a numerical ranking where the number attached to each event is a linear function of the numbers of case repetitions. One naturally wonders, what would it take to make these numbers probabilities. That is, when is there a probability measure ¹c for each case c, such that memory I induces the same ranking P of events as the measure c2M I(c)¹c ? The next results provide some partial answers to this question. 25

Obviously, a necessary condition for a representation by additive measures is de Finetti's cancellation axiom (restricted to non-included events): A6 Cancellation: For every I 2 J, and for every three pairwise nonincluded events A; B; C such that (A [ B) \ C = ;, we have A %I B , A [ C %I B [ C. Our result is that, if we impose A6 on top of the conditions of the previous Theorem, the resulting matrix v can be normalized so that it is additive in events. Namely, the rows of v satisfy; v(A [ B; ¢) = v(A; ¢) + v(B; ¢) whenever A \ B = ;. Moreover, in this case we obtain uniqueness of the representation up to multiplication by a positive constant. Formally: Theorem 7 : Given f%I gI2J as above, the following two statements are equivalent if j-j ¸ 5 : (i) f%I gI2J satisfy A1*-A4* and A6; (ii) There are ¯nite, signed, and ¯nitely additive measures f¹c gc2M on §0 such that: (¤¤)

if I 2 J and A; B 2 §0 are non-included, then A %I B

i®

P

c2M

I(c)¹c (A) ¸

P

c2M

I(c)¹c (B) ,

and for every list of orderly di®erentiated events (A; B; C; D), the convex hull of di®erences of the row-vectors (¹c (A) ¡ ¹c (B))c2M ; (¹c (B) ¡ ¹c2M (C))c2M , and (¹c (C) ¡ ¹c (D))c2M does not intersect RM ¡. Furthermore, in this case the measures f¹c gc2M are unique up to multiplication by a positive number. The condition j-j ¸ 5 in the last two theorems is justi¯ed by the following: Proposition 8 Theorem 7 may not hold if - contains less than ¯ve states.

26

To prove the proposition, a counter-example is constructed where - = f1; 2; 3; 4g; for every I 2 J, %I satisfy A1*-A4* and A6, and there does not exist a matrix with rows that are additive (on §0 ) representing f%I gI2J . The statement (and proof) of Theorem 7 does not restrict - to be ¯nite. Yet, the restricted diversity axiom may do so. For instance, it is easy to see that, if the measures ¹c are non-negative, A4* can be satis¯ed only if - is, indeed, ¯nite. However, one may have versions of the theorem that allow in¯nite (say, with an in¯nite set of memories, as in Theorem 13 below). Theorem 7 only guarantees representation by signed measures. Indeed, since we only use comparisons of pairwise non-included events, the data f%I gI2J do not imply that likelihood rankings are monotone with respect to set inclusion. One may require that, for each I 2 J, %I , be a qualitative probability according to de Finetti (1937). Speci¯cally, we require that the following axiom be satis¯ed. A1' Qualitative Probability: For every I 2 J, %I satis¯es: (i) %I is complete and transitive on §; (ii) for every three events A; B; C such that (A [ B) \ C = ;, we have A %I B , A [ C %I B [ C; (iii) for every event A, A %I ;; (iv) If I 6= 0 then - ÂI ;. Observe that this condition would strengthen both A1 and A6. One may conjecture that imposing this condition would yield a representation such as in (¤¤) of Theorem 6 for all pairs of events. The following proposition states that this conjecture is false. Proposition 9 Assume that f%I gI2J satisfy A1', A2*, A3*, A4*. It is possible that the signed measures f¹c gc2M obtained in Theorem 7 fail to be nonnegative. 27

6

Frequentism

The framework of Section 5 does not assume any formal relationship between past cases and states of the world. Indeed, one of the strengths of the approach outlined above is that any such relationships may be inferred from plausibility rankings given various memories, rather than assumed a-priori. Still, an interesting special case, which is also an important test case, is the situation where memory consists only of past occurrences of the same states that are now possible. For instance, one may be asked to rank the possible outcomes of a roll of a die based on empirical frequencies of these outcomes in past rolls of the same die. It would be reassuring to know that our approach is compatible with frequentism, i.e., that the numerical rankings derived in Section 3 may boil down to relative empirical frequencies in this case. Assume, then, that M = - = f1; : : : ; ng, where I 2 J is interpreted as the empirical frequencies of the possible outcomes. Assume that f%I gI2J are qualitative probability relations. We impose two additional assumptions. The ¯rst is a symmetry axiom, stating that the names of the outcomes are immaterial. The second is a speci¯city axiom, requiring that an outcome that has never been observed does not increase the plausibility of events containing it. For the symmetry axiom, we introduce the following notation: let ¼ : f1; : : : ; ng ! f1; : : : ; ng be a permutation. For I 2 J, de¯ne I ± ¼ 2 J by I ± ¼(c) = I(¼(c)) and for A 2 § let A¼ 2 § be de¯ned by A¼ = f¼ ¡1 (i) j i 2 Ag. Using this notation,

A7 Symmetry: For every permutation ¼, every I 2 J, and every A; B 2 §, A %I±¼ B, i® A¼ %I B¼ . Next we introduce A8 Speci¯city: Assume that for j 2 - and I 2 J, I(j) = 0. For every A; B 2 §, such that B %I A, we also have B %I A [ fjg. We can now state the frequentism result:

Theorem 10 Assume that n ¸ 5 and that f %I gI2J satisfy A10 , A2*, A3*, 28

A4*,A7, and A8. Let f¹i g be the measures provided by Theorem 7. Then (¹i (fjg))1·i;j·n is the identity matrix, up to multiplication by a positive number. In the case n < 5 Theorem 7 does not provide a representation of f%I gI by measures ¹i . Yet, we can simply assume a representation along the lines of Theorem 7 and obtain a similar result: Proposition 11 Let f %I gI2J be a collection of binary relations on § = 2- , and let f¹i g be non-negative measures on - such that for every I 2 J and every A; B 2 §, A %I B

i®

X

c2M

I(c)¹c (A) ¸

X

I(c)¹c (B) .

c2M

Assume that f %I gI2J satisfy A10 ;A7 and A8. Then (¹i (fjg))1·i;j·n is the identity matrix, up to multiplication by a positive number. Observe that A7 and A8 are obviously necessary for the last two results.

7 7.1

Derivation of Utility Under Risk The result for ¯nite spaces

The previous sections addressed the question of the formation of probabilities, as general purpose numerical representations of likelihood judgments. In this section we deal with the way these probabilities might be used in decision making under uncertainty. A natural benchmark is the paradigm of expected utility maximization, namely the suggestion that probabilities be used for the computation of expected value of a desirability function, and that decisions be made so as to maximize this expected value. Yet one may ask, how natural is expected utility maximization? And which utility function should be maximized? One possible answer is the classical expected utility theorem of von-Neumann and Morgenstern (1944), who derived a utility function and 29

the expected utility formula, assuming a preference relation between lotteries with given probabilities. This section presents an answer that is conceptually similar to that of vNM's, though closer in spirit to the analysis above. In the next section we proceed to relate it to the more general problem of decision making in face of uncertainty, that is, when probabilities are not explicitly given. Assume that a decision maker is facing a decision problem with a nonempty set of acts A and a ¯nite, non-empty set of states of the world -. (Observe that here - plays the role of M in Sections 2-6, whereas the set of acts A is the counterpart of - in Sections 5-6.) Such problems are often represented by a \decision matrix", or a \game against nature", attaching an outcome to each act-state pair (a; !). We do not assume any knowledge about this set of outcomes or about the structure of the matrix, and hence suppress it completely. (Equivalently, one introduces a formal set of abstract outcomes that is simply the set of pairs A £ -.) Let ¢ = ¢(-) be the set of probability distributions on -. We assume that, for every probability vector p 2 ¢, the decision maker has a binary preference relation %p over A. We now formulate axioms on f%p gp2¢ that correspond to those of Section 2: A1** Order: For every p 2 ¢, %p is complete and transitive on A.

A2** Combination: For every p; q 2 ¢ and every a; b 2 A, if a %p b (a Âp b) and a %q b, then a %®p+(1¡®)q b (a Â®p+(1¡®)q b) for every ® 2 (0; 1). A3** Archimedean Axiom: For every a; b 2 A and p 2 ¢, if a Âp b, then for every q 2 ¢ there exists ® 2 (0; 1) such that, a Â®p+(1¡®)q b. A4** Diversity: For every list (a; b; c; d) of distinct elements of A there exists p 2 ¢ such that a Âp b Âp c Âp d. If jAj < 4, then for any strict ordering of the elements of A there exists p 2 ¢ such that Âp is that ordering. The reader will not be surprised to ¯nd the following result. Theorem 12 : The following two statements are equivalent if jAj ¸ 4 : 30

(i) f%p gp2¢ satisfy A1** - A4**; (ii) There is a matrix u : A £ - ! R such that: (¤¤)

for every p 2 ¢ and every a; b 2 A, a %p b i®

P

!2- p(!)u(a; !) ¸

P

!2-

p(!)u(b; !) ,

and for every list (a; b; c; d) of distinct elements of A, the convex hull of di®erences of the row-vectors (u(a; ¢) ¡ u(b:¢)); (u(b; ¢) ¡ u(c; ¢)); and (u(c; ¢) ¡ u(d; ¢)) does not intersect R¡ . Furthermore, in this case the matrix u is unique in the following sense: u and w both satisfy (¤¤) i® there are a scalar ¸ > 0 and a matrix v : A£- ! R with identical rows (i.e., with constant columns) such that w = ¸u + v . Finally, in the case jAj < 4, the numerical representation result (as in (¤¤)) holds, and uniqueness as above is guaranteed. The proof of this theorem is a straightforward adaptation of that of Theorem 1.

7.2

Discussion

Theorem 12 states that, given a set of acts A and a set of states -, one may ¯nd a matrix of utility numbers, U = (u (a; !))a2A;!2- , such that, given any probability vector on -, the decision maker ranks the acts according to their expected utility. Thus, in a manner analogous to the result of vonNeumann and Morgenstern (vNM), we derive a utility function together with expected utility maximization, from choices between acts whose distributions are known. Moreover, as in vNM's result, we assume a weak order that is continuous in probabilities and that respects some form of convexity in the space of distributions, from which follows a representation that is linear in probabilities. Yet, there are several di®erences between vNM's result and Theorem 12. vNM assume that one may compare all lotteries de¯ned over a 31

given set of outcomes. Thus, one may think of the data for vNM's theorem as comparisons of all possible distributions over the entries in the matrix. By contrast, we assume that the decision maker can compare only lotteries derived from two rows in the matrix for the same probability vector over the columns. In particular, the data assumed in Theorem 12 will not include a comparison of a certain outcome (i.e., a degenerate lottery) to a non-degenerate lottery. As a result, we need here an additional axiom to guarantee the expected utility representation. We opt for a diversity axiom. This axiom gives us, in addition to the representing matrix, an independence or separation condition. The latter implies, among other things, that there are no (weakly) dominated rows in the representing matrix. Finding an exact characterization of the numerical representation (¤¤) (without the separation condition) is a possible direction for future research. Observe that the decision matrix may be viewed as the game form of a two-person game. In this case a probability vector p 2 ¢ is interpreted as a mixed strategy of the column player. For each such strategy, the row player has a preference ranking over her pure strategies. Should these preference orders satisfy our axioms, we obtain the expected utility maximization result. The uniqueness result states that the utility function can be shifted by an additive constant in any given column, resulting in the same preference ordering for each p 2 ¢. Indeed, such a shift, as well as multiplication by a positive constant leave the best-response correspondence intact. This is the standard uniqueness assumption on payo® functions in noncooperative games. Another possible direction for future research is to obtain a similar result in the context of n-person games, where each player has preferences over her pure strategies given distributions over the pure strategies of the other n ¡ 1 players that are stochastically independent. (Compare Fishburn (1976) and Fishburn and Roberts (1978).) Finally, we note that Theorem 12 can be easily extended to in¯nite state

32

spaces, since its proof relies on separation theorems. For concreteness, we formulate the generalized result.

7.3

The general result

Assume that a decision maker is facing a decision problem with a non-empty set of acts A and a measurable space of states of the world (-; §), where § is a ¾-algebra of subsets of -. Further, assume that § includes all singletons. Let B (-; §) be the space of bounded §-measurable real-valued functions on -. Recall that ba (-; §), the space of ¯nitely additive bounded measures on §, is the dual of B (-; §). Let P denote the subset of ba (-; §) consisting of ¯nitely additive probability measures on §. Assume that, for every probability measure p 2 P, the decision maker has a binary preference relation %p over A. We now formulate axioms on f%p gp2P that correspond to those of Section 2: A1** Order: For every p 2 P, %p is complete and transitive on A.

A2** Combination: For every p; q 2 P and every a; b 2 A, if a %p b (a Âp b) and a %q b, then a %®p+(1¡®)q b (a Â®p+(1¡®)q b) for every ® 2 (0; 1). A3*** Continuity: For every a; b 2 A the set fp 2 P j a Âp bg is open in the relative weak* topology.

A4** Diversity: For every list (a; b; c; d) of distinct elements of A there exists p 2 P such that a Âp b Âp c Âp d. If jAj < 4, then for any strict ordering of the elements of A there exists p 2 P such that Âp is that ordering. We can now state Theorem 13 : The following two statements are equivalent if jAj ¸ 4 : (i) f %p gp2P satisfy A1**,A2**,A3***,A4**; (ii) For every a 2 A there exists a u(a; ¢)2B (-; §) such that:

33

(¤¤)

for every p 2 P and every a; b 2 A, a %p b i®

R

u(a; ¢)dp ¸ -

R

-

u(b; ¢)dp ,

and for every list (a; b; c; d) of distinct elements of A, the convex hull of the functions (u(a; ¢) ¡ u(b:¢)); (u(b; ¢) ¡ u(c; ¢)); and (u(c; ¢) ¡ u(d; ¢)) does not intersect B (-; §)¡ (i.e., the non-positive functions in B (-; §)). Furthermore, in this case the functions fu(a; ¢)ga2A are unique in the following sense: fu(a; ¢)ga2A and fw(a; ¢)ga2A both satisfy (¤¤) i® there are a scalar ¸ > 0 and a function v2B (-; §) such that w(a; ¢) = ¸u(a; ¢) + v for all a 2 A. Finally, in the case jAj < 4, the numerical representation result (as in (¤¤)) holds, and uniqueness as above is guaranteed.

8 8.1

Decision Theories The cognitive foundations of subjective expected utility maximization

The classical axiomatic derivations of Bayesian probabilities by Ramsey (1931), de Finetti (1937), and Savage (1954) derive probabilities from observed choices. Their behavioral approach does not purport to describe an actual mental process of belief formation. Rather, it suggests that the meaning of probability judgments be found in the way they are used in decision making. Further, this approach does not require that the decision maker be aware of probabilities, their meaning, their sources, or their uses. Finally, even if the decision maker happens to be a cognate entity who can reason in terms of probabilities, the behavioral axiomatic approach does not attempt to follow the causal link from probabilities to behavior, but to reason backwards from observed behavior to the probability judgments that could have rationalized it.

34

Our approach is di®erent. Since we tend to have greater con¯dence in behavioral models that are also cognitively plausible, our goal is to axiomatically study the cognitive process by which probabilities are formed, and then the process by which they are used in decisions, to the extent that they indeed are. Thus, we adopt a cognitive axiomatic approach that, as opposed to the behavioral one, attempts to follow the causal link from cognition to decision, in those cases where such a conscious link that involves probabilities does exist. According to our story, conscious expected utility maximization starts with developing a state space, presumably encompassing all possible eventualities; proceeds with the formation of probabilistic beliefs over this state; and concludes with the computation of expected utility with respect to this probability. Our theory does not o®er an account of the formation of the state space. States and events are assumed primitive in this paper. We will shortly see that our theory may suggest an implicit way to ascribe states of the world to a decision maker, but it does not say how these are generated. In line with the case-based approach, one may argue that the states of the world that an individual imagines are either cases that she or others have experienced in the past, or some variations on these known themes. For instance, the derivation of frequentism in Section 6 assumes that the set of possible states is exactly the set of the cases observed. More generally, one may perform certain operations on past cases to generate new possible states from them. At this point, however, our theory says nothing about the grammar of these operations. Given a state space, the theory developed in Sections 5-6 attempts to shed some light on the way in which probabilities over states are formed. The axiomatic approach derives the theoretical concept of probability from presumably-observable \at least as likely as" relations, but the formula we obtain does not treat the formation of beliefs as a \black box". Rather, it suggests that probabilities are, or should be, obtained from summations of

35

similarity values, multiplied by the number of repetitions. Speci¯cally, Theorem 7 suggests that there exists a matrix V , whose rows denote states of the world and whose columns denote past cases, such that, given a vector of nonnegative integers I 2 J, specifying the number of occurrences of each case, states of the world are assigned (not necessarily normalized) probabilities p de¯ned by p= VI P That is, for every state i, one computes c2M I(c)v(a; c) and treats it as the probability of this state. The last stage of expected utility maximization consists of computation of expected utility given the probabilities obtained above. Here again our theory does not o®er an account of the mental process of utility evaluation. Like classical utility theory, expected utility theory, and related paradigms, we derive utility indices from observations, but say very little about the way an individual actually thinks about this problem. However, Theorems 12-13 do support the expected utility paradigm, and o®er a way to de¯ne, or even elicit a utility matrix U such that, for every probability vector p, every two acts a; b are ranked according to a function f where f = Up P That is, for every act a one computes i2- p(i)u(a; i) and chooses an act so as to maximize this index. Assume that there are n acts, m cases, and s states. In this case, the matrix V is of dimension s £ m, the vector I of dimension m, and the probability vector p = V I { of dimension s. Similarly, the dimension of the matrix U is n £ s, and the evaluation vector f is of dimension n. In particular, it makes sense to multiply U by V and to get a new matrix S ´ U V of dimension n £ m. We then have f = U p = U V I = SI This algebra only says that, if probabilities p are linear in repetitions I, 36

and evaluations f are linear in probabilities p, then evaluations f are also linear in repetitions I. As we argue below, however, the maximization of a numerical function of the form f = SI is more general than expected utility theory.

8.2

Generalized case-based decision theory

Theorem 1 was formulated for a predictor who ranks eventualities based on past cases. As mentioned above, it can also be interpreted as representing preferences of a decision maker between acts. Speci¯cally, replace the set of objects X by the set of acts A, and interpret %I as a preference order over acts, given a case-repetition pro¯le I. In this case there are additional classes of examples in which the combination axiom may fail to hold. Also, the diversity axiom may become more problematic. (See the discussion in Gilboa and Schmeidler (1997).) Yet, there are many situations in which the axioms are plausible. In those situations it follows that decisions are made according to the following rule: for each act a and each case c there is a number s(a; c) P such that acts are ranked according to f (a) = c2M I(c)s(a; c). That is, there is an n £ m matrix S such that f = SI ranks acts according to their desirability. This generalized case-based decision theory is purely behaviorist: it attempts to describe the structure of a stimulus-response function, where the stimuli are past cases, and the response is the observed choice, without delving into any mental process that implements this function. As opposed to utility theory or expected utility theory, which are also based on observed choice, there is no attempt here to measure or de¯ne intentional constructs such as \probability" or \utility". As such, this behaviorist decision theory can serve as a shell, within which various speci¯c theories about cognitive processes may be embedded. For instance, starting with a representation of preferences by f = SI, one may view the equality S ´ U V above as a decomposition of the matrix S: 37

rather than lumping together the various e®ects that case c has on the desirability of act a in one number s(a; c), let us spell out the implicit reasoning by which the past case makes the act more or less desirable. If we decompose the matrix S into a product U V , we can attribute to the decision maker a reasoning process that involves states of the world. The matrix V speci¯es beliefs: to what extent each case c renders each state s plausible. The matrix U, in turn, represents tastes: to what extent the decision maker would like to choose act a in state s. Hence, if we only have access to choice behavior data (the matrix S), we may attribute to the decision maker a set of states, such that their probabilities are derived from cases as in Section 5, and that they are used in decisions according to the expected utility paradigm as in Section 7. Naturally, such a decomposition S = U V is always possible, if only by identifying states with cases and setting V to be the identity matrix. But one would hope that this decomposition may be more meaningful and intuitive than that, and, furthermore, that in many decisions the number of states may be smaller than the number of past cases. This interpretation can therefore entail a behavioral de¯nition of states of the world: given observable behavior S, imagined states are ascribed to the decision maker. Our primary interpretation, however, is cognitive: expected utility theory makes sense precisely when states of the world are palpable cognitive objects. That is, when people can imagine them, think about them, compare them, and so forth. In these situations, it stands to reason that states be ranked according to probabilities, as suggested by Theorem 7. Should a decision maker also satisfy the assumptions of Theorem 12, expected utility maximization will result. The behaviorist theory according to which the decision maker maximizes a function f = SI (I 2 J) is also a generalization of case-based decision theory (Gilboa and Schmeidler (1995,1997,1999) and Gilboa, Schmeidler, and Wakker (1999)). In this speci¯cation, a past case is a triple c = (q; b; r) of a problem q, in which an act b was chosen and which resulted in an outcome

38

r. The decision maker assesses the similarity of choosing b in problem q to choosing an act a in the present problem p, and the support that the case c lends to the act a is the product of this similarity s((p; a); (q; b)) by the utility of the outcome experienced in the past u(r). The more general formulation f = SI allows the degree of support s(a; c) not to be a product of similarity and utility. Moreover, even if s(a; c) does take such a product form, the utility value may di®er from the utility of the outcome experienced in the past. Finally, it is also possible that the past case c does not involve any decision at all. For instance, the case c may be the fact that it rained on a particular day. Such a case does not tell the story of a decision problem, yet it may well a®ect the desirability of acts in future decision problems. The behaviorist decision rule f = SI thus has two extreme special cases: expected utility theory and case-based decision theory. We argue that it may be intuitively appealing in many intermediate situations as well. The expected utility decomposition f = U V I is cognitively plausible only in those situations that allow a complete speci¯cation of states of the world. Further, when such a speci¯cation exists, expected utility theory demands that the decision maker be able to assess the utility of every possible outcome, and that she weigh these utilities according to her subjective probabilities regardless of the degree of hypothetical reasoning that was required to assess utility values. Case-based decision theory (as in Gilboa and Schmeidler (1995, 1997) and Gilboa, Schmeidler, and Wakker (1999)), by contrast, ignores possible outcomes that have not been re°ected in past cases. Unless one introduces hypothetical or counterfactual cases, it does not allow the decision maker any hypothetical reasoning. The present generalization of case-based decision theory allows us to describe a decision maker who is capable of hypothetical reasoning, but who is not necessarily committed to have well-de¯ned answers to all hypothetical questions. Further, such a decision maker may assign di®erent weights to outcomes that have actually been experienced as compared to those that are merely imagined.

39

We therefore ¯nd that the present theory provides a rather general behaviorist foundation for the description of decision under uncertainty. Yet, in each application our con¯dence in such a theory depends on the plausibility of the cognitive story that is told within its framework.

9

A Voting Theory Interpretation

The result of Section 2 above may also be interpreted as a derivation of scoring rules in voting theory. Smith (1973), Young (1975), and Myerson (1995) provide such axiomatic derivations. These models bear similarity to ours in that they assume a selection of candidates as a function of the number of votes cast of each possible ballot, corresponding in our model to the ranking of eventualities as a function of the number of repetitions of each case. Yet several di®erences exist. First, Smith and Young assume that the possible votes are the permutations of the set of candidates. Myerson allows an abstract set of votes, but employs a neutrality axiom, which relates the set of candidates to the set of votes. By contrast, our model does not presuppose any relationship between cases and eventualities. This also allows very di®erent interpretations as in Section 3 above. Second, while Smith assumes that selection is a complete ordering of the candidates, Young and Myerson assume only a choice correspondence. Our model, like Smith's, therefore assumes that more information is given in the data. On the other hand, we derive almost-unique vectors fva ga and can thus claim to provide a de¯nition of these vectors by in-principle observable qualitative plausibility rankings.

Appendix: Proofs Proof of Theorem 1: Theorem 1 is reminiscent of the main result in Gilboa and Schmeidler (1997). In that work, cases are assumed to involve numerical payo®s, and algebraic and topological axioms are formulated in the payo® space. Here,

40

by contrast, cases are not assumed to have any structure, and the algebraic and topological structures are given by the number of repetitions. This fact introduces two main di±culties. First, the space of \contexts" for which preferences are de¯ned is not a Euclidean space, but only integer points thereof. This requires some care with the application of separation theorems. Second, repetitions can only be non-negative. This fact introduces several complications, and, in particular, changes the algebraic implication of the diversity condition. We present the proof for the case jXj ¸ 4. The proofs for the cases jXj = 2 and jXj = 3 will be described as by-products along the way. We start by proving that (i) implies (ii). We ¯rst note that the following homogeneity property holds: Claim 1 For every I 2 ZM + and every k 2 N, %I =%kI . Proof: Follows from consecutive application of the combination axiom. ¤ In view of this claim, we extend the de¯nition of %I to functions I whose values are non-negative rationals. Given I 2 QM + , let k 2 N be such that kI 2 ZM + and de¯ne %I = %. %I is well-de¯ned in view of Claim 1. By de¯nition and Claim 1 we also have: Claim 2 (Homogeneity) For every I 2 QM + and every q 2 Q , q > 0 : %qI = %I : Claim 2, A1, and A2 imply: Claim 3 (The order axiom) For every I 2 QM + , %I is complete and transitive on X, and (the combination axiom) for every I; J 2 QM + and every a; b 2 X and p; q 2 Q , p; q > 0: if a %I b (a ÂI b) and a %J b, then a %pI+qJ b (a ÂpI+qJ b) . Two special cases of the combination axiom are of interest: (i) p = q = 1; and (ii) p + q = 1: Cases 2 and 3, and the Archimedean axiom, A3, imply the following version of the axiom for the QM + case: 41

Claim 4 (The Archimedean axiom) For every I; J 2 QM + and every a; b 2 X, if a ÂI b, then there exists r 2 [0; 1) \ Q such that a ÂrI+(1¡r)J b. It is easy to conclude from Claim 4 (and 3), that for every I; J 2 QM + and every a; b 2 X, if a ÂI b, then there exists r 2 [0; 1) \ Q such that a ÂpI+(1¡p)J b for every p 2 (r; 1) \ Q. The following notation will be convenient for stating the ¯rst lemma. For every a; b 2 X let Y ab ´ fI 2 QM + j a ÂI bg and ab W ´ fI 2 QM + j a %I bg: Observe that by de¯nition and A1: Y ab ½ W ab ; W ab \ Y ba = ;; and W ab [ Y ba = QM + . The ¯rst main step in the proof of the theorem is: Lemma 1 For every distinct a; b 2 X there is a vector vab 2 RM such that, ab (i) W ab = fI 2 QM + j v ¢ I ¸ 0g; ab (ii) Y ab = fI 2 QM + j v ¢ I > 0g; ab (iii) W ba = fI 2 QM + j v ¢ I · 0g; ab (iv) Y ba = fI 2 QM + j v ¢ I < 0g; (v) Neither v ab · 0 nor vab ¸ 0; (vi) ¡vab = v ba . Moreover, the vector vab satisfying (i)-(iv), is unique up to multiplication by a positive number.

The lemma states that we can associate with every pair of distinct eventualities a; b 2 X a separating hyperplane de¯ned by v ab ¢ x = 0 (x 2 RM ), such that a %I b i® I is on a given side of the plane (i.e., i® v ab ¢ I ¸ 0). Observe that if there are only two alternatives, Lemma 1 completes the proof of su±ciency: for instance, one may set v a = v ab and v b = 0. It then follows that a %I b i® v ab ¢ I ¸ 0, i.e., i® v a ¢ I ¸ v b ¢ I. More generally, we will show in the following lemmata that one can ¯nd a vector v a for every alternative a, such that, for every a; b 2 X, v ab is a positive multiple of (va ¡ vb ). 42

c ab and Before starting the proof we introduce additional notation: let W Yb ab denote the convex hulls (in RM ) of W ab and Y ab , respectively. For a subset B of RM let int(B) denote the set of interior points of B: Proof of Lemma 1: We break the proof into several claims.

Claim 5 For every distinct a; b 2 X, Y ab \ int(Yb ab ) 6= ; .

Proof: By the diversity axiom Y ab 6= ; for all a; b 2 X; a 6= b: Let I 2 M Y ab \ ZM + and let J 2 Z+ with J(m) > 1 for all m 2 M: By the Archimedean jM j axiom there is an l 2 N such that K = lI + J 2 Y ab : Let (xj )2j=1 be the 2jMj distinct vectors in RM with coordinates 1 and ¡1: For j; (j = 1; :::; 2jMj ); de¯ne yj = K + xj . Obviously, yj 2 QM + for all j. By Claim 4 there is an rj 2 [0; 1) \ Q such that zj = rj K + (1 ¡ rj )yj 2 Y ab (for all j). Clearly, the convex hull of f zj j j = 1; :::; 2jMj g, which is included in Yb ab , contains an open neighborhood of K: ¤ cba \ int(Yb ab ) = ; . Claim 6 For every distinct a; b 2 X, W

Proof: Suppose, by way of negation, that for some x 2 int(Yb ab ) there are (yi )ki=1 and (¸i )ki=1 , k 2 N such that for all i; yi 2 W ba ; ¸i 2 [0; 1]; §ki=1 ¸i = 1, and x = §ki=1 ¸i yi : Since x 2 int(Yb ab ); there is a ball of radius " > 0 around x included in Yb ab . Let ± = "=(2§ki=1 jjyi jj) and for each i let qi 2 Q \ [0; 1] such that jqi ¡ ¸i j < ± , and §ki=1 qi = 1. Hence, y = §ki=1 qi yi 2 QM + and jjy ¡ xjj < ba ", which, in turn, implies y 2 Yb ab \QM + . Since for all i : yi 2 W , consecutive application of the combination axiom (Claim 3) yields y = §ki=1 qi yi 2 W ba . On the other hand, y is a convex combination of points in Y ab ½ QM + and thus it has a representation with rational coe±cients (because the rationals are an algebraic ¯eld). Applying Claims 3 consecutively as above, we conclude that y 2 Y ab { a contradiction.¤ The main step in the proof of Lemma 1: The last two claims imply that cab and Yb ba satisfy the conditions of a separating (for all a; b 2 X; a 6= b) W 43

hyperplane theorem. So there is a vector v ab 6= 0 and a number c so that

Moreover,

cab v ab ¢ I ¸ c for every I 2 W v ab ¢ I · c for every I 2 Yb ba : c ab ) v ab ¢ I > c for every I 2 int(W v ab ¢ I < c for every I 2 int(Yb ba ) :

By homogeneity (Claim 2), c = 0. Parts (i)-(iv) of the lemma are restated as a claim and proved below. ab Claim 7 For all a; b 2 X; a 6= b: W ab = fI 2 QM ¢ I ¸ 0g; Y ab = + j v ab ab fI 2 QM ¢ I > 0g; W ba = fI 2 QM ¢ I · 0g; and Y ba = fI 2 + j v + j v ab QM + j v ¢ I < 0g. ab Proof: (a) W ab ½ fI 2 QM + j v ¢ I ¸ 0g follows from the separation result and the fact that c = 0. ab (b) Y ab ½ fI 2 QM ¢ I > 0g: assume that a ÂI b, and, by way of + j v negation, v ab ¢ I · 0. Choose a J 2 Y ba \ int(Yb ba ). Such a J exists by Claim 5. Since c = 0, J satis¯es v ab ¢ J < 0. By Claim 4 there exists r 2 [0; 1) such that rI +(1¡r)J 2 Y ab ½ W ab . By (i), vab ¢(rI +(1¡r)J) ¸ 0. But v ab ¢I · 0 ab and v ab ¢ J < 0, a contradiction. Therefore, Y ab ½ fI 2 QM + j v ¢ I > 0g. ab (c) Y ba ½ fI 2 QM + j v ¢I < 0g: assume that b ÂI a and, by way of negac ab ). tion, v ab ¢I ¸ 0. By Claim 5 there is a J 2 Y ab with J 2 int(Yb ab ) ½ int(W cab ) implies v ab ¢ J > 0. Using the Archimedean axThe inclusion J 2 int(W iom, there is an r 2 [0; 1) such that rI + (1 ¡ r)J 2 Y ba . The separation theorem implies that v ab ¢ (rI + (1 ¡ r)J) · 0, which is impossible if v ab ¢ I ¸ 0 ab and v ab ¢ J > 0. This contradiction proves that Y ba ½ fI 2 QM + j v ¢ I < 0g. ab (d) W ba ½ fI 2 QM ¢ I · 0g: assume that b %I a, and, by way + j v ab of negation, v ¢ I > 0. Let J satisfy b ÂJ a. By (c), v ab ¢ J < 0. De¯ne r = (vab ¢ I)=(¡v ab ¢ J) > 0. By homogeneity (Claim 2), b ÂrJ a. By Claim 3,

44

I + rJ 2 Y ba . Hence, by (c), vab ¢ (I + rJ) < 0: However, direct computation yields v ab ¢ (I + rJ) = v ab ¢ I + rv ab ¢ J = 0, a contradiction. It follows that ab W ba ½ fI 2 QM + j v ¢ I · 0g. ab (e) W ab ¾ fI 2 QM + j v ¢ I ¸ 0g: follows from completeness and (c). ab (f) Y ab ¾ fI 2 QM + j v ¢ I > 0g: follows from completeness and (d). ab (g) Y ba ¾ fI 2 QM + j v ¢ I < 0g: follows from completeness and (a). ab (h) W ba ¾ fI 2 QM + j v ¢ I · 0g: follows from completeness and (b). ¤ Completion of the proof of the Lemma.

M Part (v) of the Lemma, i.e., v ab 2 = RM + [ R¡ for a 6= b; follows from the facts that Y ab 6= ? and Y ba 6= ?. Before proving part (vi), we prove uniqueness. Assume that both v ab and uab satisfy (i)-(iv). In this case, uab ¢ x · 0 M implies v ab ¢ x · 0 for all x 2 RM + . (Otherwise, there exists I 2 Q+ with uab ¢ I · 0 but vab ¢ I > 0, contradicting the fact that both v ab and uab satisfy (i)-(iv).) Similarly, uab ¢ x ¸ 0 implies vab ¢ x ¸ 0. Applying the ab same argument for vab and uab , we conclude that fx 2 RM + j v ¢ x = 0g = ab b ab b ba fx 2 RM + j u ¢ x = 0g. Moreover, since int(Y ) 6= ? and int(Y ) 6= ?, ab it follows that fx 2 RM ¢ x = 0g \ int(RM + j v + ) 6= ?. This implies that fx 2 RM j v ab ¢ x = 0g = fx 2 RM j uab ¢ x = 0g, i.e., that v ab and uab have the same null set and are therefore a multiple of each other. That is, there exists ® such that uab = ®v ab . Since both satisfy (i)-(iv), ® > 0. Finally, we prove part (vi). Observe that both vab and ¡v ba satisfy (i)(iv) (stated for the ordered pair (a; b)). By the uniqueness result, ¡v ab = ®v ba for some positive number ®. At this stage we rede¯ne the vectors fvab ga;b2X from the separation result as follows: for every unordered pair fa; bg ½ X one of the two ordered pairs, say (b; a); is arbitrary chosen and then v ab is rescaled such that v ab = ¡vba . (If X is of an uncountable power, the axiom of choice has to be used.)¤

Lemma 2 For every three distinct eventualities, f; g; h 2 X, and the corresponding vectors v f g ; v gh ; v f h from Lemma 1, there are unique ®; ¯ > 0 such 45

that: ®v f g + ¯vgh = v f h : The key argument in the proof of Lemma 2 is that, if v f h is not a linear combination of v f g and v gh , one may ¯nd a vector I for which ÂI is cyclical. If there are only three alternatives f; g; h 2 X, Lemma 2 allows us to complete the proof as follows: choose an arbitrary vector vf h that separates between f and h. Then choose the multiples of vf g and of v gh de¯ned by the lemma. Proceed to de¯ne v f = v f h , vg = ¯v gh , and v h = 0. By construction, (vf ¡ v h) is (equal and therefore) proportional to v f h , hence f %I h i® v f ¢ I ¸ v h ¢ I. Also, (v g ¡ v h ) is proportional to v gh and it follows that g %I h i® vg ¢ I ¸ vh ¢ I. The point is, however, that, by Lemma 2, we obtain the same result for the last pair: (vf ¡ vg ) = (vf h ¡ ¯v gh ) = ®v f g and f %I g i® v f ¢ I ¸ vg ¢ I follows. Proof of Lemma 2: First note that for every three distinct eventualities, f; g; h 2 X, if vf g and vgh are colinear, then for all I either f ÂI g , g ÂI h or f ÂI g , h ÂI g. Both implications contradict diversity. Therefore any two vectors in fv f g ; v gh; v f h g are linearly independent. This immediately implies the uniqueness claim of the lemma. Next we introduce Claim 8 For every distinct f; g; h 2 X, and every ¸; ¹ 2 R, if ¸v f g +¹v gh · 0, then ¸ = ¹ = 0. Proof: Observe that Lemma 1(v) implies that if one of the numbers ¸; and ¹ is zero, so is the other. Next, suppose, per absurdum, that ¸¹ 6= 0,and consider ¸vf g · ¹v hg . If, say, ¸; ¹ > 0, then v f g ¢I ¸ 0 necessitates vhg ¢I ¸ 0. Hence there is no I for which f ÂI g ÂI h, in contradiction to the diversity axiom. Similarly, ¸ > 0 > ¹ precludes f ÂI h ÂI g; ¹ > 0 > ¸ precludes g ÂI f ÂI h; and ¸; ¹ < 0 implies that for no I 2 QM + is it the case that h ÂI g ÂI f. Hence the diversity axioms holds only if ¸ = ¹ = 0. ¤ 46

We now turn to the main part of the proof. Suppose that v f g ; vgh ; and v hf are column vectors and consider the jM j £ 3 matrix (vf g ; v gh ; vhf ) as a 2-person 0-sum game. If its value is positive, then there is an x 2 ¢(M ) such that v f g ¢ x > 0, v gh ¢ x > 0, and vhf ¢ x > 0. Hence there is an I 2 QM + \ ¢(M ) that satis¯es the same inequalities. This, in turn, implies that f ÂI g, g ÂI h, and h ÂI f - a contradiction. Therefore the value of the game is zero or negative. In this case there are ¸; ¹; ³ ¸ 0, such that ¸vf g + ¹v gh + ³vhf · 0 and ¸ + ¹ + ³ = 1: The claim above implies that if one of the numbers ¸; ¹ and ³ is zero, so are the other two. Thus ¸; ¹; ³ > 0: We therefore conclude that there are ® = ¸=³ > 0 and ¯ = ¹=³ > 0 such that (¤)

®v f g + ¯v gh · v f h

Applying the same reasoning to the triple h; g, and f , we conclude that there are °; ± > 0 such that (¤¤)

°v hg + ±v gf · v hf .

Summation yields (¤ ¤ ¤)

(® ¡ ±)v f g + (¯ ¡ °)v gh · 0.

Claim 8 applied to inequality (¤ ¤ ¤) implies ® = ± and ¯ = °. Hence inequality (¤¤) may be rewritten as ®v f g + ¯v gh · v f h , which together with (¤) yields the desired representation.¤ Lemma 2 shows that, if there are more than three alternatives, the likelihood ranking of every triple of alternatives can be represented as in the theorem. The question that remains is whether these separate representations (for di®erent triples) can be \patched" together in a consistent way.

47

Lemma 3 There are vectors fv ab ga;b2X;a6=b , as in Lemma 1, and for any three distinct acts, f; g; h 2 X, the Jacobi identity vf g + v gh = vf h holds. Proof: The proof is by induction, which is trans¯nite if X is uncountably in¯nite. The main idea of the proof is the following. Assume that one has rescaled the vectors v ab for all alternatives a; b in some subset of acts A ½ X, and one now wishes to add another act to this subset, d 62 A. Choose a 2 A and consider the vectors v ad ; v bd for a; b 2 A. By Lemma 2, there are unique positive coe±cients ®; ¯ such that v ab = ®vad +¯v db . One would like to show that the coe±cient ® does not depend on the choice of b 2 A. Indeed, if it did, one would ¯nd that there are a; b; c 2 A such that the vectors v ad ; v bd ; v cd are linearly dependent, and this contradicts the diversity axiom. Claim 9 Let A ½ X, jAj ¸ 3, d 2 XnA. Suppose that there are vectors fvab ga;b2A;a6=b , as in Lemma 1, and for any three distinct acts, f; g; h 2 X, v f g + v gh = v f h holds. Then there are vectors fv ab ga;b2A[fdg;a6=b , as in Lemma 1, and for any three distinct acts, f; g; h 2 X, v f g + v gh = vf h holds. Proof: Choose distinct a; b; c 2 A. Let v^ad ,^ v bd , and v^cd be the vectors provided by Lemma 1 when applied to the pairs (a; d), (b; d), and (c; d), respectively. Consider the triple fa; b; dg. By Lemma 2 there are unique coe±cients ¸(fa; dg; b); ¸(fb; dg; a) > 0 such that (I) v ab = ¸(fa; dg; b)^ v ad + ¸(fb; dg; a)^ v db Applying the same reasoning to the triple fa; c; dg, we ¯nd that there are unique coe±cients ¸(fa; dg; c); ¸(fc; dg; a) > 0 such that v ac = ¸(fa; dg; c)^ v ad + ¸(fc; dg; a)^ v dc . or (II) v ca = ¸(fa; dg; c)^ vda + ¸(fc; dg; a)^ v cd . 48

We wish to show that ¸(fa; dg; b) = ¸(fa; dg; c). To see this, we consider also the triple fb; c; dg and conclude that there are unique coe±cients ¸(fb; dg; c); ¸(fc; dg; b) > 0 such that (III) v bc = ¸(fb; dg; c)^ vbd + ¸(fc; dg; b)^ v dc . Since a; b; c 2 A, we have v ab + vbc + v ca = 0 and it follows that the summation of the right-hand sides of (I), (II), and (III) also vanishes: [¸(fa; dg; b) ¡ ¸(fa; dg; c)]^ vad + [¸(fb; dg; c) ¡ ¸(fb; dg; a)]^ v bd + [¸(fc; dg; a) ¡ ¸(fc; dg; b)]^ v cd = 0. If some of the coe±cients above are not zero, the vectors f^ vad ; v^bd ; v^cd g are linearly independent, and this contradicts the diversity axiom. For instance, if v^ad is a non-negative linear combination of v^bd and v^cd , for no I will it be the case that b ÂI c ÂI d ÂI a. We therefore obtain ¸(fa; dg; b) = ¸(fa; dg; c) for every b; c 2 Anfag. Hence for every a 2 A there exists a unique ¸(fa; dg) > 0 such that, for every distinct a; b 2 A v ab = ¸(fa; dg)^ v ad +¸(fb; dg)^ v db . De¯ning v ad = ¸(fa; dg)^ v ad completes the proof of the claim.¤ The lemma is proved by an inductive application of the claim. In case X is not countable, the induction is trans¯nite.¤ Note that Lemma 3, unlike Lemma 2, guarantees the possibility to rescale simultaneously all the v ab -s from Lemma 1 such that the Jacobi identity will hold on X. We now complete the proof that (i) implies (ii). Choose an arbitrary act, say, e in X. De¯ne v e = 0, and for any other alternative, a, de¯ne v a = v ae , where the vae -s are from Lemma 3. 49

Given I 2 QM + and a; b 2 X we have: a %I b , vab ¢ I ¸ 0 , (v ae + veb ) ¢ I ¸ 0 ,

(v ae ¡ v be ) ¢ I ¸ 0 , v a ¢ I ¡ v b ¢ I ¸ 0 , v a ¢ I ¸ v b ¢ I

The ¯rst implication follows from Lemma 1(i), the second from the Jacobi identity of Lemma 3, the third from Lemma 1(vi), and the fourth from the de¯nition of the v a -s. Hence, (¤¤) of the theorem has been proved. It remains to be shown that the vectors de¯ned above are such that conv(fv a ¡ v b ; v b ¡ v c ; v c ¡ v d g) \ RM ¡ = ;. Indeed, in Lemma 1(v) we have a b M shown that v ¡ v 2 = R¡ . To see this one only uses the diversity axiom for the pair fa; bg. Lemma 2 has shown, among other things, that a non-zero linear combination of v a ¡ vb and v b ¡ v c cannot be in RM ¡ , using the diversity axiom for triples. Linear independence of all three vectors was established in Lemma 3. However, the full implication of the diversity condition will be clari¯ed by the following lemma. Being a complete characterization, we will also use it in proving the converse implication, namely, that part (ii) of the theorem implies part (i). The proof of the lemma below depends on Lemma 1. It therefore holds under the assumptions that for any distinct a; b 2 X there is an I such that a ÂI b: Lemma 4 For every list (a; b; c; d) of distinct elements of X, there exists I 2 J such that a ÂI b ÂI c ÂI d i®

conv(fvab ; v bc ; v cd g) \ RM ¡ = ; :

Proof: There exists I 2 J such that a ÂI b ÂI c ÂI d i® there exists I 2 J such that v ab ¢ I; vbc ¢ I; v cd ¢ I > 0. This is true i® there exists a probability vector p 2 ¢(M ) such that v ab ¢ p; v bc ¢ p; v cd ¢ p > 0. Suppose that v ab ; v bc ; and v cd are column vectors and consider the jM j £3 matrix (v ab ; vbc ; v cd ) as a 2-person 0-sum game. The argument above implies that there exists I 2 J such that a ÂI b ÂI c ÂI d i® the maximin in this 50

game is positive. This is equivalent to the minimax being positive, which means that for every mixed strategy of player 2 there exists c 2 M that guarantees player 1 a positive payo®. In other words, there exists I 2 J such that a ÂI b ÂI c ÂI d i® for every convex combination of fv ab ; vbc ; v cd g at least one entry is positive, i.e., conv(fv ab ; vbc ; v cd g) \ RM ¡ = ;. ¤ This completes the proof that (i) implies (ii). ¤

Part 2: (ii) implies (i) It is straightforward to verify that if f%I gi2QM are representable by fv a ga2X + as in (¤¤), they have to satisfy Axioms 1-3. To show that Axiom 4 holds, we quote Lemma 4 of the previous part. ¤ Part 3: Uniqueness It is obvious that if wa = ®v a + u for some scalar ® > 0, a vector u 2 RM , and all a 2 X, then part (ii) of the theorem holds with the matrix w replacing v: Suppose that fv a ga2X and fwa ga2X both satisfy (¤¤), and we wish to show that there are a scalar ® > 0 and a vector u 2 RM such that for all a 2 X, wa = ®v a + u. Recall that, for a 6= b, v a 6= ¸v b and wa 6= ¸wb for all 0 6= ¸ 2 R by A4. Choose a 6= e (a; e 2 X, e satis¯es v e = 0). From the uniqueness part of Lemma 1 there exists a unique ® > 0 such that (w a ¡w e ) = ®(v a ¡ve ) = ®v a . De¯ne u = we . We now wish to show that, for any b 2 X, w b = ®vb + u. It holds for b = e and b = a, hence assume that a 6= b 6= e. Again, from the uniqueness part of Lemma 1 there are unique °; ± > 0 such that (wb ¡ w a ) = °(v b ¡ v a )

(we ¡ wb ) = ±(ve ¡ v b ) .

Summing up these two with (w a ¡ w e ) = ®(v a ¡ v e ), we get 0 = ®(v a ¡ v e ) + °(v b ¡ v a ) + ±(v e ¡ vb ) = ®va + °(v b ¡ v a ) ¡ ±v b . 51

Thus (® ¡ °)v a + (° ¡ ±)v b = 0 . Since va 6= v e = 0, vb 6= v e = 0, and v a 6= ¸vb if 0 6= ¸ 2 R, we get ® = ° = ±. Plugging ® = ° into (wb ¡ wa ) = °(v b ¡ v a ) proves that wb = ®v b + u. ¤ This completes the proof of the Theorem 1. Proof of Proposition 2 { Insu±ciency of A1-3: We show that without the diversity axiom representability is not guaranteed. Let X = f a; b; c; d g and jM j = 3. De¯ne the vectors in R3 : sab = (¡1; 1; 0); sac = (0; ¡1; 1); sad = (1; 0; ¡1); sbc = (2; ¡3; 1); scd = (1; 2; ¡3); sbd = (3; ¡1; ¡2) ; and sxy = ¡syx and sxx = 0 for x; y 2 X: For x; y 2 X and I 2 J de¯ne: x %I y i® sxy ¢ I ¸ 0 . It is easy to see that with this de¯nition the axioms of continuity and combination, and the completeness part of the order axiom hold. Only transitivity requires a proof. This can be done by direct veri¯cation. It su±ces to check the four triples (x; y; z) where x; y; z 2 X are distinct and in alphabetical order. For example, since 2sab + sbc = sac , a %I b and b %I c imply a %I c. Suppose by way of negation that there are four vectors in R3 , v a ; vb ; v c ; v d that represent %I for all I 2 J as in Theorem 1. By the uniqueness of representations of half-spaces in R3 , for every pair x; y 2 X there is a positive, real number ¸xy such that ¸xy sxy = (vx ¡ v y ): Further, ¸xy = ¸yx . Since (va ¡ vb ) + (vb ¡ v c ) + (v c ¡ va ) = 0 , we have ¸ab (¡1; 1; 0) + ¸bc (2; ¡3; 1) + ¸ca (0; 1; ¡1) = 0 . So, ¸bc = ¸ca , and ¸ab = 2¸bc . Similarly, (va ¡ vb ) + (v b ¡ v d ) + (vd ¡ v a ) = 0 implies ¸ab (¡1; 1; 0) + ¸bd (3; ¡1; ¡2) + ¸da (¡1; 0; 1) = 0, which in turn implies ¸ab = ¸bd and ¸da = 2¸bd . Finally, (va ¡ v c ) + (v c ¡ vd ) + (v d ¡ v a ) = 0 implies ¸ac (0; ¡1; 1) + ¸cd (1; 2; ¡3) + ¸da (¡1; 0; 1) = 0. Hence, ¸ac = 2¸cd and ¸da = ¸cd . 52

Combining the above equalities we get ¸ac = 8¸ca , a contradiction. Obviously the diversity axiom does not hold. For explicitness, consider the order (b; c; d; a). If for some I 2 J, say I = (k; l; m), b ÂI c and c ÂI d, then 2k¡3l+m > 0 and k +2l¡3m > 0. Hence, 4k¡6l+2m+3k+6l¡9m = 7k ¡ 7m _ > 0. But d ÂI a means m ¡ k > 0, a contradiction. ¤ Proof of Proposition 5: Let there be given four events A; B; C; D in a measurable space (-; §). It su±ces to consider the minimal algebra containing these events, which is ¯nite. Assume that the atoms in this algebra are -0 = f1; : : : ; ng. It is easy to see that there exists a probability measure P on § such that P (A) > P (B) > P (C) > P (D) i® the following LP problem is feasible: (P )

Minx2RM 0¢x + s:t:

(1A ¡ 1B ) ¢ x ¸ 1 (1B ¡ 1C ) ¢ x ¸ 1 (1C ¡ 1D ) ¢ x ¸ 1 :

Problem (P ) is feasible i® its dual is bounded. The dual is

(D)

M ax

®+ ¯+ °

s:t:

®(1A ¡ 1B ) + ¯(1B ¡ 1C ) + °(1C ¡ 1D ) · 0

(¤)

®; ¯; ° ¸ 0 Clearly, (D) is bounded i® it is bounded by zero, and this is the case i® its only feasible point is ®; ¯; ° = 0. We now prove that conditions (i) and (ii) imply these equalities. Assume, by way of negation, that (¤) holds and let ®; ¯; ° ¸ 0 but not all three are equal to zero. Observe ¯rst that if exactly one of ®; ¯;and ° is positive, it follows that at least one of the inclusions A ½ B, B ½ C, or C ½ D holds, in contradiction to (i). 53

Next assume that exactly two of ®, ¯, and ° are positive. Suppose ¯rst that ¯ = 0 . Condition (ii) implies that for some k, 1A (k) + 1C (k) > 1B (k) + 1D (k). Since the indicator vectors take only the values 1 or 0, there are ¯ve possibilities in which this inequality may be satis¯ed: 1 + 1 > 1 + 0; 0 + 1; 0 + 0 or 1 + 0; 0 + 1 > 0 + 0. Evaluating (¤) at k for these ¯ve possibilities leads to one of the following: ® ¡ ® +° · 0; ® + ° ¡ ° · 0; ® + ° · 0; ® · 0; or ° · 0. All these inequalities are inconsistent with positivity of ® and ° , so this case is ruled out. If ° = 0, consider i 2 AnC (whose existence follows from (i)). Now (¤) implies ® · 0 or ® ¡ ® + ¯ · 0. The former contradicts positivity of ®, whereas the latter { of ¯. The last case is ® = 0. Let l 2 BnD (again, such a state exists by (i)). In this case (¤) implies ¯ · 0 or ¯ ¡ ¯ + ° · 0, a contradiction to ¯; ° > 0. So we are left with the case where all three, ®, ¯, and ° , are positive. Since the ¯rst inequality in (ii) does not hold, there exists i for which 1A (i) + 1B (i) > 1C (i) + 1D (i). We consider the ¯ve possibilities, as above: 1 + 1 > 1 + 0; 0 + 1; 0 + 0 or 1 + 0; 0 + 1 > 0 + 0. Substituting the corresponding values in (¤) we get that at least one of the following ¯ve inequalities holds: ® ¡ ® +¯ ¡ ¯ + ° · 0; ® ¡ ® +¯ ¡ ° · 0; ® ¡ ® +¯ · 0; ® · 0; ¡® +¯ · 0. The positivity of ®, ¯, and ° leaves only ¯ · ® or ¯ · °. By (i), there is a state j 2 AnD. Again, using (¤) , we get one of the following four inequalities: ® · 0, ® ¡ ® +¯ · 0, ® ¡ ¯ +° · 0, and ® ¡ ® +¯ ¡ ¯ + ° · 0, depending on whether j belongs to B and to C or not. Three of the inequalities are directly inconsistent with positivity of ®, ¯, and °. The fourth, namely, ® +° · ¯, is also inconsistent with positivity when coupled with either of ¯ · ® or ¯ · ° obtained previously. This concludes the proof that (i) and (ii) su±ce for the existence of a probability measure P such that P (A) > P (B) > P (C) > P (D). The necessity of (i) and (ii) is obvious. ¤¤ Proof of Theorem 6: We will construct the numerical representation by \patching" together 54

numerical representations for subsets of events that are properly di®erentiated. In doing so, a few auxiliary results will be of help. We start by the following de¯nition. Suppose that for a subset of events ¢ ½ §0 there is a matrix v ¢ : ¢ £ M ! R. Let ¢0 be a subset of ¢. We say that v¢ ranks ¢0 if for every non-included E; F 2 ¢0 , and every I 2 J, P P ¢ ¢ E %I F i® . c2M I(c)v (E; c) ¸ c2M I(c)v (F; c) ¢ In order to extend a numerical representation v to a larger set ¢, we would like to know that such a representation is unique on relatively small subsets ¢0 . For instance, when we consider triples of pairwise nonincluded events A; B; C, it would be nice to know that a function v that ranks fA; B; Cg is unique as in Theorem 1. For this one would need to have a diversity axiom for triples of events, namely that for any permutation thereof there exists an I 2 J such that ÂI agrees with the given permutation. One would expect this to follow from the seemingly more powerful diversity assumption A4*, stated for all quadruples of orderly di®erentiated events. However, not every triple of pairwise non-included events can be complemented to a quadruple of orderly di®erentiated events. Consider, for instance, n = 5, A = f1; 2; 3g, B = f4g, C = f5g. These are pairwise nonincluded, but there is no event D that is pairwise non-included with respect to all of them. This case is anomalous enough to deserve a de¯nition. We say that three events A; B; C 2 §0 form a one-two-many partition if two of them are singletons and the third is the complement of the (union of the) ¯rst two. We now state Lemma 1: Let (A; B; C) be a list of pairwise non-included events in §0 that do not form a partition containing two singletons. Then there exists I 2 J such that A ÂI B ÂI C. Proof: First observe that, since A; B; C are pairwise non-included, there exist probability measures P on - such that P (A) > P (B) > P (C). We will 55

shortly prove that there exists an event D that is non-included with respect to each of A; B, and C. We can choose a probability P such that P (A) > P (B) > P (C) and that P (D) di®ers from each of fP (A); P (B); P (C)g. This would mean, by Proposition 3.1, that one of the four lists, f(D; A; B; C), (A; D; B; C), (A; B; D; C), (A; B; C; D)g is orderly di®erentiated. We can then use the diversity axiom for that list to deduce the desired result. We therefore wish to prove that there exists an event D that is nonincluded with respect to each of A; B, and C. If there exists a state i 2 (A [ B [ C)c , then choosing D = fig will do. Assume, then, that A[B[C = -. Next, if A; B, and C are pairwise disjoint, since they do not form a onetwo-many partition, it has to be the case that at least two of them contain more than one element. Assume, without loss of generality, that fi; jg ½ A and that fk; lg ½ B. In this case, D = fi; kg is non-included with respect to each of A; B, and C. We now deal with the cases where A [ B [ C = - and the three events are not pairwise disjoint. Assume that one of them is disjoint from the other two, say, A \ (B [ C) = ;. Let i 2 BnC and j 2 CnB. Since not all three events are disjoint, B \ C 6= ;, and it follows that D = fi; jg is non-included with respect to all three. We therefore assume that A [ B [ C = - and that each event intersects the union of the other two. Observe that this implies that none of A; B; C is a singleton. Assume that one of them is not contained in the union of the other two, say, An (B [ C) 6= ;. Then there exists i 2 An (B [ C). Choose j 2 BnA and let D = fi; jg. Finally, we are left with the case where A [ B [ C = - and where each event is contained in the union of the other two. Hence every state in is included in at least two of fA; B; Cg. In this case Ac = (B \ C) nA; B c = (A \ C) nB; and C c = (A \ B) nC. Since A; B; C are in §0 , each of these pairwise disjoint events includes at least two elements. Construct D by selecting one of each. ¤ 56

Lemma 2: Let there be given two subsets ¾; ¿ ½ §0 with corresponding matrices v¾ : ¾ £ M ! R and v ¿ : ¿ £ M ! R. Assume that there are three pairwise non-included events A; B; C 2 ¾ \ ¿ that do not form a one-twomany partition, and that both v ¾ and v ¿ rank fA; B; Cg. If, for all c 2 M; v ¾ (A; c) = v ¿ (A; c) and v ¾ (B; c) = v¿ (B; c), then also v ¾ (C; c) = v ¿ (C; c) for all c 2 M . Proof: In view of the previous lemma, the triple fA; B; Cg satis¯es the conditions of Theorem 1. The conclusion follows from the uniqueness result of the Theorem. ¤ Lemma 3: Let there be given a subset of events ¢ ½ §0 and a matrix v ¢ : ¢ £ M ! R. Assume that A; B; C; D 2 ¢ are properly di®erentiated. If v ¢ ranks fA; C; Dg and fB; C; Dg, then it also ranks fA; Bg. Proof: Since A; B; C; D are properly di®erentiated, we can apply Theorem 1 to ¾ ´ fA; B; C; Dg, and conclude that there exists a matrix v ¾ that ranks ¾. Without loss of generality we may assume that v ¾ (C; c) = v ¢ (C; c) and v ¾ (D; c) = v ¢ (D; c) for all c 2 M . Since A; B; C; D are properly differentiated, we know that none of fA; C; Dg, fB; C; Dg forms a one-twomany partition. The previous remark therefore states that for all c 2 M; v ¾ (A; c) = v ¢ (A; c) and v¾ (B; c) = v ¢ (B; c) also hold. Since v¾ ranks fA; Bg, so does v¢ . ¤ 0

We now turn to the construction of a matrix v § that ranks §0 . The strategy is as follows: we start by constructing a representation for all pairs. (Recall that § includes all singletons, and therefore also all pairs.) We then extend it to singletons. Next we show that it can be extended to all events in §0 . De¯ne ¢2 = fA 2 §0 j jAj = 2g. Lemma 4: There exists v¢2 : ¢2 £ M ! R that ranks ¢2 . Proof: Choose an element of -, and call it 1. We ¯rst consider ¢12 = ff1; igji 6= 1g. Any four events in ¢12 are properly di®erentiated, and The1 orem 2.1 can be applied to obtain a representation v ¢2 : ¢12 £ M ! R that 57

ranks ¢12 . 1 Next we wish to extend v ¢2 to v ¢2 on all of ¢2 : Consider A = fi; jg where i; j 6= 1. Any four events in ¢12 [ fAg are properly di®erentiated. Hence Theorem 2.1 o®ers a unique de¯nition of v¢2 (A; c) for c 2 M . We claim that v ¢2 ranks ¢2 . Let there be given A; B 2 ¢2 . Notice that if at least one of them is in ¢12 , we already know that ¢12 ranks fA; Bg. Assume, then, that A; B 2 ¢2 n ¢12 . Distinguish between two cases: (i) A \ B = ;; and (ii) A \ B 6= ;. In Case (i) we have A = fi; jg, B = fk; lg where i; j; k; l are distinct and di®er from 1. Observe that any distinct four events out of ¢12 [ fA; Bg are properly di®erentiated. It follows that there exists a matrix de¯ned on (¢12 [ fA; Bg) £ M that ranks ¢12 [ fA; Bg, and that it coincides with our de¯nition of v ¢2 . In Case (ii) we have A = fi; jg, B = fi; kg where i; j; k are distinct and di®er from 1. Observe that ff1; ig; f1; jg; fi; jg; fi; kgg are properly di®erentiated. By Lemma 3, since v ¢2 ranks ff1; ig; f1; jg; fi; jgg and ff1; ig; f1; jg; fi; kgg, it also ranks ffi; jg; fi; kgg. ¤ Our next step is to extend v ¢2 to singletons. Let ¢2 = fA 2 §0 j jAj · 2g. Lemma 5: There exists v¢2 : ¢2 £ M ! R that ranks ¢2 . Proof: Let v ¢2 equal v ¢2 on all pairs. We now extend it to all singletons, and then show that this extension indeed ranks ¢2 . Let there be given i 2 -. Choose distinct j; k; l 6= i. There is a unique de¯nition of v¢2 (fig; c) (for c 2 M) such that v ¢2 ranks ffig; fj; kg; fj; lgg. We ¯rst claim that v ¢2 thus de¯ned ranks ¢2 [ ffigg. Indeed, for any event B 2 ¢2 that di®ers from fj; kg; fj; lg, fB; fj; kg; fj; lgg are pairwise non-included, and they do not form a one-two-many partition. Hence v ¢2 ranks fB; fj; kg; fj; lgg. Further, if i 2 = B, then fB; fig; fj; kg; fj; lgg are also properly di®erentiated, and, by Lemma 3, v ¢2 also ranks fB; figg. Let this be the de¯nition of v ¢2 (fig; c) for each i 2 -. We need to show that for every distinct i; j 2 -, v ¢2 ranks ffjg; figg. Since j-j ¸ 5, there are 58

two distinct C; D 2 ¢2 that are disjoint from fi; jg. Thus, ffig; fjg; C; Dg are properly di®erentiated, while v ¢2 ranks both ffig; C; Dgg and ffjg; C; Dgg, which completes the proof. ¤ The following combinatorial lemma will prove useful. Lemma 6: Let A and B be two non-included events in §0 . Then there are C and D in §0 , with jCj; jDj = 2, such that fA; B; C; Dg are properly di®erentiated. Proof: Assume without loss of generality that jAj ¸ jBj, and distinguish between two cases: Case 1: jAnBj ¸ 2; Case 2: jAnBj = jBnAj = 1. In Case 1 set i; j 2 AnB, i 6= j, and k 2 BnA. De¯ne C = fi; kg and D = fj; kg. By direct veri¯cation one can check that the conclusion of the lemma holds. Next consider Case 2. Since j-nAj; j-nBj ¸ 2, there is a state p 2 -n(A [ B). Assume ¯rst that there is also a state q 6= p, such that q 2 -n(A [ B). Let i 2 AnB, and k 2 BnA: If A and B are singletons, there exists j 2 -nfi; k; p; qg and we can choose C = fj; pg and D = fj; qg. Otherwise (namely, A \ B 6= ;) de¯ning C = fi; pg and D = fk; qg results in the desired conclusion. We are now left with Case 2 under the additional restriction that j-n(A[ B)j = 1. Since j-j ¸ 5, we know that jA \ Bj ¸ 2. De¯ne C = fp; kg and D = fp; lg, where k 6= l, k; l 2 A \ B. Once again, direct veri¯cation completes the proof. ¤ Completion of the Proof of Theorem 6: We now proceed to de¯ne 0 v = v§ : §0 £M ! R that ranks §0 , as an extension of v¢2 . Let there be given an event A 2 §0 with jAj > 2. Let i be an element in - that is not included in A, and let j; k be two elements that are in A. Since fA; fi; jg; fi; kgg are pairwise non-included and they do not form a one-two-many partition, Theorem 2.1 applies to them and o®ers a unique de¯nition of v(A; c) (for all c 2 M) such that v ranks fA; fi; jg; fi; kgg. We now wish to show that v thus de¯ned ranks fA; Bg for all non-included A; B 2 §0 . If jAj; jBj · 2, the result follows from the de¯nition of v as an 59

extension of v ¢2 . Assume, then, that jAj > 2. We split the proof into three parts according to the number of elements in B. First assume that jBj = 2. Recall that i is an element in - that is not included in A, and that j; k are two elements that are in A. There are four cases to check, according to whether i 2 B and whether A and B are disjoint. In all cases, direct veri¯cation shows that fA; B; fi; jg; fi; kgg are properly di®erentiated, and Lemma 3 implies that v ranks fA; Bg. Next assume that jBj = 1, i.e., that B = flg where l 2 = A. Choose s 2 = A [ flg. Since j; k are in A, fA; B; fs; jg; fs; kgg are properly di®erentiated, and Lemma 3 implies that v ranks fA; flgg. Finally, assume jBj > 2. By Lemma 6 there are C and D in §0 , with jCj; jDj = 2, such that fA; B; C; Dg are properly di®erentiated. We already know v ranks fA; C; Dg and fB; C; Dg, since C and D are pairs. Again, Lemma 3 implies that v ranks fA; Bg. It is easy to see that, if for some list of orderly di®erentiated events (A; B; C; D), the convex hull of the vectors (¹c (A)¡¹c (B))c ; (¹c (B)¡¹c (C))c , and (¹c (C) ¡ ¹c (D))c does intersect RM ¡ , then A4* is violated, as in the proof of Theorem 1. ¤ The proof of su±ciency and of uniqueness are as in the proof of Theorem 1. ¤¤ Proof of Theorem 7: The fact that (ii) implies (i) is immediate. We will show that (i) implies (ii) and the uniqueness result for the case of a ¯nite algebra. Assume, then, that - = f1; : : : ; ng (recall that n ¸ 5) and that § = 2- , and let v^ be the matrix provided by Theorem 3.2. Set w (¢) = v^(f1; 2g ; ¢) ¡ v^(f1g ; ¢) ¡ v^(f2g ; ¢), and de¯ne a matrix v by v(A; ¢) = v^(A; ¢) + w (¢) so that v(f1; 2g ; ¢) = v(f1g ; ¢) + v(f2g ; ¢). We wish to show that for this v, P v(A; ¢) = i2A v(fig ; ¢) for every A 2 §0 . Observe that if we ¯nd such a v, it is unique up to positive multiplication, since a shift by a vector w can preserve additivity only if w = 0. 60

Some notation may prove useful. We will use event superscripts to denote rows in the matrix. Thus, vA denotes the vector v(A; ¢). Also, for A; B 2 §0 , de¯ne vA;B = v A ¡ vB . Note that for any three pairwise non-included events A; B; C, we have the Jacobi identity v A;C = v A;B + v B;C . A key observation is the following. Lemma 1: For every three pairwise non-included events A; B; C such that (A [ B) \ C = ;, there exists a unique ¸ > 0 such that vA[C;B[C = ¸vA;B . Proof: By the cancellation axiom we know that the set of memories I for which A [ C %I B [ C is precisely the same set for which A %I B. The conclusion follows from the uniqueness result of Theorem 2.1 applied to the events A; B. ¤ Next we show that, when we focus on a singleton C = fig, the coe±cient ¸ does not depend on the sets A; B: Lemma 2: For every i 2 - there exists a unique ¸i > 0 such that, for every non-included A; B 2 §0 such that i 62 A [ B and jAj; jBj < n ¡ 2, v A[fig;B[fig = ¸i v A;B . Proof: Consider three events A; B; D that are pairwise non-included, none of which includes i, and none of which has more than n ¡ 3 elements. We know that vA[fig;B[fig = v A[fig;D[fig + vD[fig;B[fig . Applying Lemma 1 to each of the three elements above we obtain numbers ¸; ¹; ´ > 0 such that v A[fig;B[fig = ¸v A;B v A[fig;D[fig = ¹v A;D v D[fig;B[fig = ´v D;B hence ¸v A;B = ¹v A;D + ´v D;B 61

or vA;B =

¹ A;D ´ D;B v + v : ¸ ¸

But since we also have the Jacobi identity v A;B = v A;D + vD;B the restricted diversity axiom implies that ¸ = ¹ = ´. Applying this result for the case that A; B; D are singletons, we conclude that there exists ¸i > 0 such that the conclusion holds for all singletons A; B. Next, for every A 2 §0 such that i 62 A and jAj < n ¡ 2, and every j 62 A [ fig, we also get v A[fig;fi;jg = ¸i v A;fjg . Similarly, consider such a set A, and choose j; k 62 A [ fig and l 2 A. De¯ne B = fl; kg and D = fjg. We obtain that v A[fig;fi;k;lg = ¸i v A;fk;lg . Finally, consider two non-included events A; B 2 §0 such that i 62 A [ B and jAj; jBj < n ¡ 2. Choose l 2 AnB and k 2 BnA. De¯ne D = fl; kg and apply the result above to these three events. The desired result follows. ¤ Lemma 3: The exists a unique ¸ > 0 such that ¸ = ¸i for all i 2 (where ¸i is the coe±cient de¯ned by Lemma 2). Further, this ¸ satis¯es, for every distinct i; j; k; l 2 -, v fi;jg;fk;lg = ¸v fig;fkg + ¸v fjgflg : Proof: By the Jacobi identity and Lemma 2, v fi;jg;fk;lg = v fi;jg;fk;jg + v fk;jg;fk;lg = ¸j v fig;fkg + ¸k v fjg;flg . Similarly, v fi;jg;fk;lg = vfi;jg;fl;jg + v fl;jg;fk;lg = ¸j v fig;flg + ¸l v fjg;fkg hence ¸j vfig;fkg + ¸k vfjg;flg = ¸j v fig;flg + ¸l vfjg;fkg 62

or ¸j v fig;fkg = ¸j vfig;flg + ¸k v flg;fjg + ¸l v fjg;fkg : That is, vfig;fkg = v fig;flg +

¸k flg;fjg ¸l fjg;fkg v + v : ¸j ¸j

But the Jacobi identity also implies v fig;fkg = v fig;flg + v flg;fjg + v fjg;fkg and, coupled with the diversity axiom, this means that ¸j = ¸k = ¸l . ¤ Until the end of the proof we reserve the symbol ¸ for the coe±cient de¯ned by Lemma 3. Lemma 4: For every distinct i 2 -nf1; 2g, v f1;ig = v f1g + (1 ¡ ¸)v f2g + ¸v fig

v f2;ig = (1 ¡ ¸)v f1g + v f2g + ¸v fig : Proof: By symmetry between 1 and 2, it su±ces to prove the ¯rst equation. Since v A;B = vA ¡ v B (applied to A = f1; ig and B = f1; 2g) we get v f1;ig = vf1;ig;f1;2g + vf1;2g . By Lemma 3 = ¸vfig;f2g + vf1;2g and, using the fact v f1;2g = v f1g +vf2g and v A;B = v A ¡v B (applied to A = fig and B = f2g), = ¸v fig ¡ ¸v f2g + v f1g + v f2g

= v f1g + (1 ¡ ¸)v f2g + ¸v fig : 63

¤ Lemma 5: For every distinct i; j 2 -nf1; 2g, vfi;jg = (1 ¡ ¸)vf1g + (1 ¡ ¸)vf2g + ¸vfig + ¸v fjg : Proof: Since v A;B = v A ¡ vB (applied to A = fi; jg and B = f1; 2g) we get v fi;jg = vfi;jg;f1;2g + vf1;2g . Using the Jacobi identity v fi;jg = v fi;jg;f1;ig + vf1;ig;f1;2g + v f1;2g . By Lemma 3 the right-hand side equals ¸v fjg;f1g + ¸v fig;f2g + vf1;2g . Using the facts v f1;2g = vf1g + vf2g and vA;B = vA ¡ v B we get v fi;jg = ¸v fjg ¡ ¸v f1g + ¸vfig ¡ ¸v f2g + vf1g + vf2g = (1 ¡ ¸)v f1g + (1 ¡ ¸)v f2g + ¸v fig + ¸v fjg :

¤ Lemma 6: ¸ = 1. Proof: Consider the set f3; 4; 5g. By de¯nition of v A;B , vf3;4;5g = v f3;4;5g;f2;5g + v f2;5g : By Lemma 3 the right-hand side equals ¸v f3;4g;f2g + vf2;5g = ¸v f3;4g ¡ ¸vf2g + v f2;5g where the last equality follows from v A;B = v A ¡ v B . Using Lemma 5 for the set f3; 4g and Lemma 4 for the set f2; 5g, we get 64

v f3;4;5g = ¸[(1 ¡ ¸)vf1g + (1 ¡ ¸)vf2g + ¸vf3g + ¸v f4g ] ¡ ¸v f2g +(1 ¡ ¸)v f1g + v f2g + ¸vf5g

= (1 ¡ ¸2 )v f1g + (1 ¡ ¸2 )vf2g + ¸2 v f3g + ¸2 v f4g + ¸v f5g By symmetry between 4 and 5, we also get vf3;4;5g = (1 ¡ ¸2 )v f1g + (1 ¡ ¸2 )vf2g + ¸2 v f3g + ¸vf4g + ¸2 v f5g . Equating the two, it has to be the case that

¸v f4g + v f5g = v f4g + ¸v f5g or (1 ¡ ¸)v f4g;f5g = 0 : By the diversity axiom, v f4g;f5g 6= 0, hence the conclusion follows. ¤ Lemma 7: For every i; j 2 -, v fi;jg = vfig + v fjg .

Proof: Use Lemmata 4,5, and 6. ¤ Lemma 8: For every A 2 §0 , v A =

P

i2A v

fig

.

Proof: By induction on jAj. We already know that the lemma holds for jAj · 2. Assume it is true for jAj · k and consider a set B with jBj = k + 1 < n ¡ 1. Choose i 2 B and j 62 B. Since B and fi; jg are non-included, we may write v B = vB;fi;jg + v fi;jg : Denoting A = Bnfig, we can write v B;fi;jg = vA[fig;fjg[fig where i 62 A [ fjg. Using Lemmata 3 and 6, v B;fi;jg = v A[fig;fjg[fig = v A;fjg . 65

Plugging this into the ¯rst equation we get v B = v A;fjg + v fi;jg = vA ¡ v fjg + v fi;jg . Using the induction hypothesis =

X k2A

vfkg ¡ v fjg + v fig + v fjg =

X

v fkg :

k2B

This completes the proof of Theorem 7 for the ¯nite case. ¤ We now turn to the case of an in¯nite -. Choose an element of -, say, 1. Consider all the ¯nite sub-algebras §0 of § that include f1g and that have 0 at least ¯ve atoms. For each such §0 there exists a matrix v§0 : §00 £ M ! R that ranks §00 and that satis¯es additivity with respect to the union of disjoint sets. Further, such a matrix is unique up to multiplication by a positive 0 scalar. Choose one such §0 and a corresponding v §0 for it. Let §1 be another 0 sub-algebra of § that includes f1g and that has at least ¯ve atoms. Let v §1 be a matrix that ranks §01 . Let §2 be the minimal algebra containing both §0 and §1 . Applying the result to the (¯nite) sub-algebra §2 , there exists ¡ 0 ¢ 0 0 v §2 that ranks §02 . Since v §2 ranks both §00 and §01 , v§2 (f1g; c) c2M di®ers ¡ 0 ¢ ¡ 0 ¢ from both v §0 (f1g; c) c2M and v §1 (f1g; c) c2M by a positive multiplicative scalar. This implies that the latter two vectors also di®er by a positive 0 multiplicative scalar, and that there is a unique v §1 that ranks §01 and agrees 0 with v§0 on the row of f1g. (Observe that this row is not the 0 vector, due to the diversity condition.) Given an event A 2 §0 , choose a ¯nite sub-algebra §1 as above that 0 includes A, and de¯ne (v(A; c))c2M by the unique v§1 identi¯ed above. By similar considerations one concludes that this de¯nition of (v(A; c))c2M does not depend on the choice of §1 that includes A. Hence v is well-de¯ned. Finally, we wish to show that if A; B; A [ B 2 §0 , and A \ B = ;, we have

66

v(A [ B; c) = v(A; c) + v(B; c)

for all c 2 M .

Choose a ¯nite sub-algebra §1 as above that includes both A and B, and 0 0 observe that v coincides with v §1 on §01 , where v§1 satis¯es this equality by Lemma 8. ¤ ¤ Proof of Proposition 8 (Counter-example to Theorem 7 when j-j < 5): Consider M = - = f1; 2; 3; 4g and de¯ne f%I gI by the matrix v given by the table below (where empty entries denote zeroes). It is straightforward to check that f%I gI satis¯es A6 for all triples of events, yet the matrix v is not additive in events. v A (c)

Case c 1 1

2

3

2

2

2

3

2

4 Event A

4

2

1,2

2

2

1,3

2

1

1,4

2

1

2,3

1

2

2,4

1

2

3,4

1

1

1 1 1 1 1

1

Proof of Proposition 9 (Measures fail to be non-negative): Consider M = - = f1; 2; 3; 4; 5g and de¯ne f%I gI as follows: if A ½ B, then A -I B for every I 2 J. If A is a proper subset of B, then A ÁI B for every I 6= 0. If A and B are non-included, de¯ne f%I gI by the signed measures f¹c gc2M given by the table below (where empty entries denote zeroes): 67

Case c

¹c (fig)

State i

1

2

3

4

5

1

-1

32

32

32

32

2

2

1

3

4

4

8

5

16

1 1 1

One may verify that f%I gI are qualitative probability relations satisfying A2-A4, even though ¹1 (f1g) < 0. Proof of Theorem 10: We ¯rst show Lemma 1: The symmetry axiom implies that there are two numbers, a > b ¸ 0, such that ¹i (fjg) = b if i 6= j and ¹i (fig) = a. Observe that this condition is also equivalent to the symmetry axiom. Proof of Lemma 1: Claim 1: For every i · n there exists a number bi 2 R, such that ¹i (fjg) = bi for all j 6= i. Proof: Consider the memory I = 1fig 2 J and a permutation ¼ that swaps only two states j; k 6= i. Since I = I ± ¼, the symmetry axiom implies that fjg ¼I fkg, hence ¹i (fjg) = ¹i (fkg). ¤ We denote ai = ¹i (fig). Claim 2: For every i; j · n, aj ¡ bj = ai ¡ bi .

Proof: Consider I = 1fi;jg 2 J and a permutation ¼ that swaps only i; j. Since I = I ± ¼, the symmetry axiom implies that fig ¼I fjg, hence aj + bi = bj + ai , or aj ¡ bj = ai ¡ bi . ¤ Claim 3: For every distinct i; j; k · n, fkg ÂI fi; jg for I = 1fkg 2 J.

68

Proof: By the diversity axiom we know that there exists a memory I 2 J such that fkg ÂI fi; jg. The combination axiom implies that there exists s · n such that fkg ÂI fi; jg for I = 1fsg . We claim that this can only hold for s = k. Indeed, ¯rst observe that if s 62 fi; j; kg, then for I = 1fsg we also get, by symmetry, fig ÂI fj; kg. But, since %I is a qualitative probability, fkg ÂI fi; jg implies fj; kg ÂI fig, a contradiction. Next assume that s = i. By similar reasoning, fkg ÂI fi; jg implies fi; kg ÂI fi; jg, but this implies fkg ÂI fjg, which contradicts symmetry (as in Claim 1). s = j is similarly excluded, and the conclusion follows. ¤ Claim 4: For every i; j · n, ai = aj > 0.

Proof: Let there be given distinct i; j; k · n. Consider, for every two nonnegative integers m; l, the memory I = I(m; l) = m1fjg + l1fkg 2 J, and the permutation ¼ that swaps only i; j. Thus, I ± ¼ = m1fig + l1fkg . It follows from the symmetry axiom that fi; jg ÂI fkg i® fi; jg ÂI±¼ fkg. This means that, for every m; l ¸ 0, mai + mbi + 2lbk ¸ mbi + lak

maj + mbj + 2lbk ¸ mbj + lak

i®

or mai ¸ l(ak ¡ 2bk ) i®

maj ¸ l(ak ¡ 2bk ) : (¤)

Further, we argue that ai ; aj ; (ak ¡ 2bk ) > 0. First observe that, by Claim 3 (corresponding to m = 0; l = 1), it has to be the case that (ak ¡ 2bk ) > 0. Also, Claim 3 implies that fig ÂI fj; kg for I = 1fig (corresponding to m = 1, l = 0), and fi; jg ÂI fkg follows by monotonicity of %I with respect to set inclusion. Hence ai > 0. Similarly, aj > 0 has to hold as well. The desired result now follows from (¤). ¤ Combining Claims 1, 2, and 4, we conclude that there are two numbers, a; b 2 R, such that ¹i (fjg) = b if i 6= j and ¹i (fig) = a. Furthermore, we know that a > 0. Claim 5: a > b. 69

Proof: As in the proof of Claim 4, (a ¡ 2b) > 0 and this su±ces since a > 0. ¤ Claim 6: b ¸ 0.

Proof: Let i; j; k · n be distinct. Consider I = 1fi;jg . We know that fig ¼I fjg, hence fi; kg %I fjg. Hence a + 3b ¸ a + b. ¤

This completes the proof of Lemma 1. It remains to show that the speci¯city axiom implies that b = 0 as well. To see this, consider again the proof of Claim 6 above, and observe that if b > 0, it has to be the case that fi; kg ÂI fjg, where I = 1fi;jg . But this is a contradiction to the speci¯city axiom, because fig ¼I fjg and I(k) = 0. ¤¤ Proof of Proposition 11: Assume that for some i; j 2 -, ¹i (fjg) < 0. This means that for I = 1fig we have ; ÂI fjg, contradicting A1'. Hence ¹i are non-negative. Next we consider i 6= j and show that ¹i (fjg) = 0. Assume, to the contrary, that ¹i (fjg) > 0 for i 6= j. In this case, for I = 1fig we have ; %I ; and fjg ÂI ; while I (j) = 0, contradicting A8. Also, since, by A1', - ÂI ; for I = 1fig , it follows that ¹i (fig) > 0 for all i 2 -. Finally, consider I = 1fi;jg for i 6= j. By A7, fig ¼I fjg. Hence ¹i (fig) = ¹j (fjg). ¤¤

Proof of Theorem 13: The proof follows closely that of Theorem 1. The separation argument in Lemma 1 is more direct, since in the present case the continuity axiom directly implies that the sets Y ba are open (in the relative topology). This, in turn, implies that the sets W ab are closed and therefore compact in the weak* topology. This allows the use of a (weak) separating hyperplane theorem between two disjoint and convex sets, one of which is compact: the convex hull of W ab and the origin (i.e., f®p j ® 2 [0; 1]; p 2 W ab g) on the one hand, and f®p j ® 2 (0; 1]; p 2 Y ba g on the other. That is, we obtain a nonR zero function v ab 2 B(-; §) such that v ab dp ¸ 0 for all p 2 W ab and R ab v dp · 0 for all p 2 Y ba . Further, we argue that v ab does not vanish on W ab [ Y ba = ba1+ (-; §). If it did, then it would also vanish on ba+ (-; §), 70

and therefore also on ba¡ (-; §). But in this case it would vanish on all of ba(-; §), in view of Jordan's decomposition theorem, in contradiction to the fact that vab is non-zero. R R We argue that for some p 2 Y ba , v ab dp < 0. If not, vab dp = 0 for all p 2 Y ba . Since vab does not vanish on W ab [ Y ba , there has to exist a R R q 2 W ab with vab dq > 0. But then for all " > 0, vab d("q + (1 ¡ ")p) > 0, while "q + (1 ¡ ")p 2 Y ba for small enough " by the continuity axiom. Next, R R we argue that for all q 2 Y ab we have v ab dq > 0. Indeed, if v ab dq = 0 R for q 2 Y ab , v ab d("p + (1 ¡ ")q) < 0 for all " > 0. By a similar argument, R ab v dp < 0 for all p 2 Y ba . R R Thus Y ba ½ fp j vab dp < 0g. Since we also have W ab ½ fp j v ab dp ¸ R R 0g, Y ba ¾ fp j vab dp < 0g. That is, Y ba = fp j vab dp < 0g and W ab = R R fp j v ab dp ¸ 0g. We have also shown that Y ab ½ fp j v ab dp > 0g. To show R the converse inclusion, assume that v ab dp > 0 but a ¼p b. Choose q 2 Y ba . By the combination axiom, ®p + (1 ¡ ®)q 2 Y ba for all ® 2 (0; 1). But for ® R close enough to 1 we have v ab d(®p + (1 ¡ ®)q) > 0, a contradiction. Hence R R Y ab = fp j v ab dp > 0g and W ba = fp j vab dp · 0g. Observe that vab can be neither non-positive nor non-negative due to the diversity axiom (applied to the pair a; b) as in Theorem 1. The uniqueness of the vectors v ab in Theorem 1 was proved using the ab existence of strictly positive x 2 RM + for which v ¢ x = 0. Observe that this argument does not directly generalize for measures p 2 P. Instead, we prove uniqueness by a duality argument. Assume that vab ; uab 2 B(-; §) both satisfy conditions (i)-(v) of Lemma 1. Consider a two-person zero-sum game with a payo® matrix (v ab ; ¡uab ). Speci¯cally, (i) the set of pure strategies of player 1 (the row player) is -; (ii) the set of pure strategies of player 2 (the column player) is fL; Rg; and (iii) if player 1 chooses ! 2 -, and player 2 chooses L, the payo® to player 1 will be vab (!), whereas if player 2 chooses R, the payo® to player 1 will be ¡uab (!). Since both v ab ; uab satisfy conditions R R (i)-(iv), there is no p 2 P for which v ab dp > 0, ¡uab dp > 0. Hence 71

the maximin in this game is non-positive. Therefore, so is the minimax. It follows that there exists a mixed strategy of player 2 that guarantees a nonpositive payo® against any pure strategy of player 1. In other words, there are ®; ¯ ¸ 0 with ® + ¯ = 1 such that ®v ab (!) · ¯uab (!) for all ! 2 -. Moreover, by condition (v) ®; ¯ > 0. Hence for ° = ¯=® > 0, v ab · °uab . Applying the same argument to the game (¡vab ; uab ), we ¯nd that there exists ± > 0 such that v ab ¸ ±uab . Therefore, °uab ¸ vab ¸ ±uab for °; ± > 0. In view of part (v), there exists ! 2 - with uab (!) > 0, implying ° ¸ ±. By the same token there exists ! 0 2 - with uab (! 0 ) < 0, implying ° · ±. Hence ° = ± and v ab = °uab . Lemmata 2,3, and 4, as well as the converse direction and the proof of uniqueness, follow the proof of Theorem 1. ¤¤

72

References [1] Aumann, R. J. (1995), \Interactive Epistemology", Discussion Paper No. 67, Center for Rationality and Interactive Decision Theory, The Hebrew University of Jerusalem. de Finetti, B. (1937), \La Prevision: Ses Lois Logiques, Ses Sources Subjectives", Annales de l'Institute Henri Poincare 7: 1-68. Fagin, R. and J. Y. Halpern (1994), \Reasoning about Knowledge and Probability", Journal of the Association for Computing Machinery 41: 340-367. Fagin, R., J. Y. Halpern, and N. Megiddo (1990), \A Logic for Reasoning about Probabilities", Inf. Comput. 87: 78-128. Fine, T. (1973), Theories of Probability. New York: Academic Press. Fishburn, P. C. (1976), \Axioms for Expected Utility in n-Person Games", International Journal of Game Theory 5: 137-149. Fishburn, P. C. (1986), \The Axioms of Subjective Probability", Statistical Science 1: 335-358. Fishburn, P. C. and F. S. Roberts (1978), \Mixture Axioms in Linear and Multilinear Utility Theories", Theory and Decision 9: 161-171. Gilboa, I. and D. Schmeidler (1995), \Case-Based Decision Theory", Quarterly Journal of Economics 110: 605-639. Gilboa, I., and D. Schmeidler (1997), \Act Similarity in Case-Based Decision Theory", Economic Theory 9: 47-61. Gilboa, I. and D. Schmeidler (1999), A Theory of Case-Based Decisions, manuscript. Gilboa, I., D. Schmeidler, and P. P. Wakker (1999), \Utility in CaseBased Decision Theory", Foerder Institute for Economic Research Working Paper No. 31-99. 73

Heifetz, A., and P. Mongin (1999), \Probability Logic for Type Spaces", Games and Economic Behavior, forthcoming. Kraft, C. H., J. W. Pratt, and A. Seidenberg (1959), \Intuitive Probability on Finite Sets", Annals of Mathematical Statistics 30: 408-419. Krantz, D. H., D. R. Luce, P. Suppes, and A. Tversky (1971), Foundations of Measurement I. New York: Academic Press. Myerson, R. B. (1995), \Axiomatic Derivation of Scoring Rules Without the Ordering Assumption", Social Choice and Welfare 12, 59-74. von Neumann, J. and O. Morgenstern (1944), Theory of Games and Economic Behavior. Princeton, NJ: Princeton University Press. Ramsey, F. P. (1931), \Truth and Probability", The Foundation of Mathematics and Other Logical Essays. New York: Harcourt, Brace and Co. Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley and Sons. Smith, J. H. (1973), \Aggregation of Preferences With Variable Electorate", Econometrica 41, 1027-1041. Young, H. P. (1975), \Social Choice Scoring Functions", SIAM Journal of Applied Mathematics 28: 824-838.

74