A TURING Test Approach to Intelligent System Validation Rainer Knauf TU Ilmenau, Fakultat IA Fachgebiet Kunstliche Intelligenz Postfach 10 05 65 98684 Ilmenau
Avelino J. Gonzalez University of Central Florida Dept. of Electrical and Comnputer Engineering Orlando, FL 32816 { 2450 USA
Abstract
The authors present some ideas on developing a validity statement based on a Turing test { like methodology with a set of test cases. The (anonymous) solutions of these test cases will be rated by a panel of expert validators. The methodology is called the Turing test, because a random process of distributing the test case with solutions to the dierent validators ensures that no validator knows who the author of a test case solution is. Their ratings are used for both the assessement of the experts' competence and the assessment of the system's validity. The objective of this is, of course, to make the result of the validation process (a validity statement) as objective as possible. Furthermore, in an eort to maximize objectivity, the approach includes a competence scale for each validator. The degree of competence is estimated by considering the experts' behavior while solving the test cases and rating the test case solutions.
1 Introduction AI application elds are usually highly complex ones, where there is often no commonly accepted standard for the knowledge in these elds. That means there is no truly objective source of knowledge. Furthermore, there is usually no model which can be used to prove whether or not a given knowledge base is a correct representation of reality. If there was such a model, it would have been used already for the system's design, thus obviating the need for a knowledge-based system.1 e-mail: e-mail:
[email protected] [email protected] 1 The real objective of separating knowledge and representing it in a problem oriented manner is to facilitate its change. And the real reason for needing to change it is that it isn't correct yet.
Nevertheless some authors present a quasi-formal way of validation. They try to nd out whether or not the speci cation is correct by using some general statements (domain-dependent system requirements, invariances, integrity requirements, laws of nature, etc.) about the application domain. What they do is not really validation; it is some kind of veri cation. They verify the speci cation by using a more general speci cation in another (formal) language, the expressivity of which is closer to the domain. A typical example for going that way is [Reif96]. We do not feel that such approaches are wrong or undesirable. On the contrary, whenever such a technique can be found, it should be utilized. But how do you verify the speci cation of the speci cation? With another (even more general) speci cation : : :? On the other hand, we are convinced, that there always remains a gap between the (non-formalized) real domain knowledge, which is usually the brain content of some expert(s), and the formalized knowledge of an AI system (respectively its speci cation). So the only way to achieve a nearly-objective validation technique is to use the experts' knowledge again. At rst look, it appears useless to use the same source of knowledge for validation which already has been used for the system's design. However, there are some aspects which make this approach very useful nevertheless: 2 : 1. Experts are humans. And humans are usually not able to precisely express the content of their brain. By confronting an expert with the eect of his knowledge on well selected cases, some mistakes can be uncovered. The job of judging the correctness of a given test case solution is dierent from the job of expressing domain knowledge. The latter is a very creative one whereas the rst one is more analytical
2 The most convincing argument is, of course, that in most cases this seems to be the only way.
Rainer Knauf & Avelino J. Gonzalez: A Turing Test Approach to Intelligent System Validation in nature. Thus, it makes sense to validate the knowledge originated from a creative process by an analytical one. 2. The suggested methodology includes a competence scale of any expert involved in the validation process for any test case. This ensures that less competent expert's ratings of system's solutions have less in uence on the resulting validity statement than those of a more competent expert. Competence is, of course, a property of experts, which isn't distributed homogeneously in the eld of their expertise. Furthermore, all experts are not equally competent in a certain problem solving task. That's why the degree of competence of the experts is estimated for each expert and for each test case individually. Nevertheless, to apply the following approach to validate an expert system, we suggest to the maximum extent possible, involve experts with different views on the application eld who may have contrary opinions and who were not involved in the design process of the AI system to be validated. Buchanan and Shortlie ([BS85]) describe a Turing test type of approach in their evaluation of Mycin which shares some commonalities with our technique. While comprehensive in nature, they do not attempt to generalize it to serve for all knowledgebased systems. Our approach formalizes the technique as much as feasible and the result is a generic, albeit conceptual, one to be usable by many types of knowledge-based systems.
2 The Proposed Technique { an Overview The suggested methodology is quite similar in concept to the Turing test: Let's have one AI system which is to be validated, n experts, a set of m \good" test cases3 , two ratings 0; 1 , in which 1 means to be \correct" and 0 means to be \incorrect", and two degrees of certainty 0; 1 , in which 1 means to be \sure" and 0 means to be \unsure".
f
g
f
g
The idea of the Turing test methodology as illustrated in gure 1 is divided into four steps: 1. solving of the test cases by the expert validation panel as well as by the expert system to be validated, 2. randomly mixing the test case solutions and removing their authorship, 3. rating all (anonymous) test case solutions, and 4. evaluating the ratings. These steps are explained in detail in the following sections.
3 Solving Test Cases Having m test cases, each test case has to be solved by both the n (human) experts e1 ; : : :; en who realize the expertise 4 and
E
the one expert system en+1 which realizes the system's knowledge 5 and which is being validated.
S
This leads to m (n+1) solved test cases. Each solved
test case contains
the test case tj (1 j m, tj I), which is an input of both the expertise and the system (tj I),
2
2
the test case solver ei (1 i n + 1), who is a part of the expertise ( i ), and
E
E
the test case solution s(ei ; tj ) = sij , which is an output of both the expertise and the system (sij O unknown ).
2
[f
g
Thus, a solved test case and can be represented by a triple [tj ; ei; sij ]. The test case solution sij is either a nal conclusion of the system or a special test case solution value unknown. The latter solution value gives the experts an opportunity to express their own competence in solving a speci c test case. 3 How to generate the set of \good" test cases is the subject of other papers published by the authors and their group. 4 = Sn i , I O { cf. [KJAP97] i=1 5 = n+1 , I O { cf. [KJAP97] E
E
S
E
E
S
LIT'97
5. Leipziger Informatik-Tage vom 25. bis 26. September 1997
m test cases
??
??
expert 1
expert 2
@ ? @?
@ ? @?
A AA
A AA
k
m solved test cases
rated solutions
?
validity meter
test
- case ra-
ting table
?
k ?
m solved test cases
(n + 1) m
n m (n + 1)
rated ?solutions
test case solu- tion table m (n + 1) ? test case solutions
validity estimation
??
n
expert n+1
@ ? @?
expert system
k c
A AA
m solved test cases
(n + 1) m ?
?
expert
rated ?solutions
m solved test cases
(n + 1) m
rated solutions
?
?
anonymisator and mixer (n + 1) m anonymous test case solutions
?
validity statement Figure 1: A survey of the Turing test to estimate an AI System's validity
4 Making the Solutions Anon- 5 Rating the Solutions ymously The job of an expert (who is a validator now) is to To ensure that the human experts not be aware of a solution's author (and especially which is the system's solution and which is their own), the solved test cases are mixed and presented to the (human) experts e1 ; : : :; en in a random sequence6 without any information about the authorship. As a result of this procedure, each of the (human) experts e1 ; : : :; en gets m (n+1) anonymous test case solutions [tj ; sij ]. In addition to that, the assignment of solutions to their author (a certain expert e1 ; : : :; en or the expert system en+1 ) is kept for evaluation purposes. That's why this procedure also provides the complete solved test cases [tj ; ei; sij ] to the validity meter.
6 From a practical standpoint this mixing procedure should be carried out only within a considered test case; this gives the (human) experts the opportunity to compare the dierent solutions for a certain test case and to rank these solutions. This might be helpful to come up with a correct rating.
evaluate the n+1 solutions for each of the m test cases separately and without knowing the solvers. Among these solutions are both the system's as well as each validator's own solution. In this procedure each expert has two general ways to react to a given test case solution: 1. With the help of a rating r 0; 1 and a certainty c(r) 0; 1 an expert can express his opinion about the solution (r = 1 for \correct" and r = 0 for \incorrect") and being valid (c = 1 for \sure" and c = 0 for \unsure"). 2. Each expert has the opportunity to express lack of competence in his ability to evaluate a solution. This can be expressed by the special rating norating (c(norating) = 0)7 . 2 f
2 f
g
g
7 This is not because of any semantic reason. It is just useful
Rainer Knauf & Avelino J. Gonzalez: A Turing Test Approach to Intelligent System Validation The result of the rating procedure is a set of n m (n + 1) rated solutions. Each rating r(ei ; tj ; s(ek ; tj )) = r(ei ; tj ; skj ) = rijk is assigned to a certain solution skj of a certain expert ek (1 k n + 1) of a certain test case tj (1 j m) and a certain evaluating (human) expert ei (1 i n). Assigned to each rating rijk there is a certainty value c(rijk) = cijk . An entire rated test case solution can be represented by a quintuple [tj ; ei ; skj ; rijk; cijk].
1. his evaluation of his own competence, 2. the certainties cijk of his ratings rijk, of other experts' ek solutions sjk , 3. his consistency8 , 4. his stability9, and 5. the other experts' ek (k = i) ratings rkji of his solution sij . Each of these components will be graded with a number between 0 (which stands for \incompetent") and 1 (which stands for \competent").
6 Evaluating the Ratings
An expert's competence can be revealed as the solution sij in the solving procedure and as the ratings rijk in the rating procedure. The rst component, the competence opinion while solving the test case tj is very simple to estimate: It is 0, i the expert gave the \solution" unknown and 1, i he gave a \real" solution. The other component, the competence opinion while rating the solutions for the test case tj (with the exception of the own solution, which is considered separately) can be estimated as the ratio between the number of noratings and the number of ratings altogether, which is n after excluding the rating of his own solution. We did not nd any reason to let one of these components be more important than the other. That's why we let them contribute the same portion of 21 each. Thus, the \self-estimation" slf est(ei ; tj ) of an expert ei to be competent for a test case tj is10
This procedure is done by the validity meter, which has 1. m (n+1) solved test cases in the test case solution table and 2. n m (n + 1) rated test case solutions in the test case rating table. The data in both tables is used to compute a validity statement about the expert system. Note, that there is a special solution value called unknown and a special rating value called norating, which should be evaluated in a correct manner.
6.1 Estimating an expert's competence :::
The rst step towards a validity statement is to estimate the competence of each expert. We prefer to do that for each expert and for each test case separately due to the fact that not all experts are equal competent for a given test case and a certain expert's competence is not equally for each test case. The experts' competences depend on their dierent talents, their dierent educational backgrounds, and their dierent experiences. Experts, therefore, are likely to have dierent regions of competence within the application eld. The best opportunity to estimate the competence of an expert ei for a considered test case tj is to use
for making the formulas in section 6 simpler.
6
6.1.1 : : : by using his own opinion about being competent
slf est(ei ; tj ) = 1 ord(s = unknown) + 2 n+1ij 1 1 X ord(r = norating) ijk 2n
6
k=1;k6=i
6
6.1.2 : : : by using his certainty while rating other experts' solutions The job of judging, whether a given solution is correct or not, is totally dierent from the job of nding a solution. So it may happen, e.g., that an expert on one hand isn't able to nd a solution, but on the other 8 Does he give his own solution good marks? 9 Is he certain while rating his own solution? n 10 0 , i Logic Expr = false
ord(Logic Expr) =
1 , i Logic Expr = true
LIT'97
5. Leipziger Informatik-Tage vom 25. bis 26. September 1997
hand he can de nitely judge the validity of a given solution. The more often an expert is certain that a given solution (of another expert) is correct or not, the more he is considered to be competent (or just selfcon dent). That's why this capability of an expert should be included in our competence estimation. The competence of an expert ei based on the certainty (the certainty-estimation crt est(ei ; tj )) while rating solutions for a test case tj can be estimated as the ratio between the number of certain (cijk = 1) ratings (with the exception of the one for his own solution) 11) and the number of ratings altogether (which is n after excluding the one of the own solution): nX +1 cijk crt est(ei ; tj ) = n1
of his test case solutions sij by the other experts ek (k = i). In case an expert ek is uncertain as to whether ei 's solution is correct or not (which means, that ckji = 0), his rating for sij shouldn't be included in our competence estimation. So the external estimation frgn est(ei ; tj ) of the competence of an expert ei by the other (external) experts ek (k = i)14 for a test case tj can be estimated by the ratio between the number of certain ratings (ckji = 1) \correct" (rkji = 1) and the number certain ratings altogether, which is the average rating of all certain ratings: frgn est(ei ; tj ) = n 1 ! X (c r ) 6
6
n P
k=1;k6=i
6.1.3 : : : by using his consistency
An expert behaves consistently, if he gives his own solution12 good marks. An expert who behaves consistently can be said to be more competent than an expert who doesn't.13 inconsistency can be interpreted The consistency-based estimation cns est(ei ; tj) of an expert's ei competence for a test case tj is cns est(ei ; tj ) = riji 6.1.4 : : : by using his stability An expert behaves stably, if he gives the rating for his own solution a certainty ciji = 1 (regardless of whether it is consistent). Of course, it may happen that he changes his opinion about the correct solution after analysing the solutions of others the rating process, but that doesn't have any in uence on his stability. Stability merely means that the expert is sure whether his own solution is correct or not in the rating process. An expert who behaves stably seems to be more competent than a one who doesn't. The stabilitybased estimation stb est(ei ; tj ) of an expert's ei competence for a test case tj is stb est(ei ; tj ) = ciji 6.1.5 : : : by using the other experts' ratings of
his solution
Another component of the competence estimation of an expert ei for a test case tj are the ratings rkji
11 The rating of the own solution and its certainty is considered separately below. 12 : :: , without knowing that it is his own, :: : 13 On the other hand changig ones mind can be considered learning and, thus, disirable.
k=1;k6=i
ckji k=1;k6=i
kji kji
6.1.6 : : : by using all the ve components
We believe that the components above can be divided into three main groups: self-estimation and certainty (slf est, crt est), consistency and stability (cns est, stb est), and external estimation of competence (frgn est). We did not nd any reason to let one of these groups be more important than the others. That's why we use all three groups for estimating an expert ei 's competence cpt(ei ; tj ) for a test case tj and let them contribute equally to the nal result15 each with the same portion of 31 . For the same reason, the sources within a group contribute with equivalent portions: cpt(ei ; tj ) = 1 slf est(ei ; tj ) + 61 crt est(ei ; tj ) + 6 1 cns est(ei ; tj ) + 16 stb est(ei ; tj ) + 6 1 frgn est(ei ; tj ) 3
6.2 Estimating the (local) validity of the system for a given test case
Our next step is to come up with a statement about the (local) validity v(tj ) of the system (which is the \expert" en+1 in our scenario and in [KJAP97]) for S
14 Of course, in some cases the majority isn't right and the expert who is rated bad can be the most creative one. 15 : : : , which is a value for the expert's local competence, : : :
Rainer Knauf & Avelino J. Gonzalez: A Turing Test Approach to Intelligent System Validation a test case tj . To reach this objective, the human experts' (e1 ; : : :; en) ratings rij (n+1) (1 i n) of the system's solution s(en+1 ; tj ) = s(n+1)j should be considered. Knowing the fact, that the (human) experts ei (1 i n) have different local competences cpt(ei ; tj ), for a considered test case tj , their ratings for the system's solution should be weighted with these competences as a coecient. If an expert ei is uncertain as to whether e(n+1) 's solution is correct or not (cij (n+1) = 0), his rating for s(n+1)j should not be included in our estimation of the system's local validity. Thus, we suggest a weighted average of all certainty ratings: v(tj ) =
n P
1
(cpt(e ;t ) c i
n ? P i=1
i=1
j ij(n+1)
)
cpt(ei ; tj ) cij (n+1) rij (n+1)
6.3 Estimating the global validity of the entire system Now the entire expert system's validity v can be estimated by the average local validity v(tj ) for each test case tj 16: m X v = m1 v(tj ) j =1
Of course, computing the average validity of all test cases is not sucient in many cases. It may happen, that depending on some conclusion-related validation criteria ([AG97]) some test cases are seemed to be more important for establishing the validity of the system than others. Approches to take that fact into consideration are the subject of actual research. In any case, the expert system's validity v gets a value between 0 and 1 (both inclusive). Depending on some domain- and user- related validation criteria ([AG97]) each system is associated with a minimum validity vmin , which is a threshold value for the validity statement. That is, of course, the nal objective of this research, for the present: The system is called valid, i v vmin and invalid otherwise.
7 Summary and Conclusion The main objective of this approach is to propose a validity statement based on the results of test cases,
16 By the way, this formula to estimate the system's validity vn+1 can be used to estimate an (human) expert's \validity"
vi (1 i n) as well. Whether this really should be done, is a question to psychologists ;-)
which are solved by validating human experts as well as the system being validated. Of course, that validity statement can't be totally objective because it is based on human judgement. Nevertheless, we presented some ideas which make the approach as objective as possible. One way to improve this approach is to nd out some tendencies of the validating experts. Humans have the property to be either more pessimistic and or more optimistic in nature. We believe that a good (poor) rating of a pessimist (an optimist) should have more in uence on the validity statement than a good (poor) rating of an optimist (a pessimist). Furthermore, we feel that some validity statement, which is just a number, isn't really useful. That's why we actually work for coming up with a structured validity statement, which is (i) much more expressiv, than a \ at" number and (ii) a useful basis for system improvement.
References [AG97]
Abel, Th.; Gonzalez, A.: Utilizing Criteria
to Reduce a Set of Test Cases for Expert System Validation. In: Dankel, D.D.(ed.): Proc. of the 10. Florida AI Research Symposium (FLAIRS-97), Daytona Beach, FL,
USA, May 1997, pp. 402-406 [AKG96] Abel, Th.; Knauf, R.; Gonzalez, A.: Gene-
ration of a minimal set of test cases that is functionally equivalent to an exhaustive set, for use in knowledge-based system validation. In: Stewman, J.H.(ed.): Proc. of the 9. Florida AI Research Symposium (FLAIRS-96), Key West, FL, USA, May
[BS85]
1996, pp. 280-284 Buchanan, B.G.; Shortlie, E.H.: Rule-
based Expert Systems - The MYCIN Experiments of the Stanford Heuristic Programming Project. Reading, MA: Addison
Wesley, 1985. [KJAP97] Knauf, R.; Jantke, K.P.; Abel, Th.; Philippow, I.: Fundamentals of a TURING Test [Reif96]
Approach to Validation of AI Systems. In: Proc. of the Ilmenau Int. Colloquium 1997 Reif, W.: Risikofaktor Software. In: Proc. of the 4th Leipzig Informatic Days (LIT'96), pp. 3-10, 1996