Fundamentals of a TURING Test Approach to Validation of AI Systems

3 downloads 1465 Views 201KB Size Report
using test cases. Although Turing's approach seems not to be useful for an assessment of intelligence. (cf. Hal87]), we'll use it as a corner sto- ne of our own ...
Rainer Knauf / Klaus P. Jantke / Thomas Abel / Ilka Philippow

Fundamentals of a TURING Test Approach to Validation of AI Systems Abstract

The purpose of the present endeavor is to set the stage for more technical research towards 1. generating \quasi-exhaustive" sets of test cases (QuEST) for validation of intelligent systems as described in [AKG96], and [HJK97], e.g., 2. de ning some validation criteria and using them for a reduction of QuEST down to a \reasonable" set of test cases (ReST) as described in [AG97], and 3. evaluating these test cases in a Turing test{like session as described in [KGP97].

This paper deals with the very fundamentals of a so-called \Turing Test Methodology" for expert system validation which was recently proposed by [KGP97]. First, we survey several concepts of veri cation and validation. Our favoured concepts are lucidly characterized by the words that veri cation guarantees to build the system right whereas validation deals with building the right system. Next, we critically inspect the thoughtexperiment called the Turing test. It turns out that, although this approach may not be sucient to reveal a system's intelligence, 2 Validation and Veri cation it provides a suitable methodological background to certify a system's validity. The This section delivers a brief introduction into principles of our validation approach are sur- the underlying concepts of system validation and veri cation as it can be found in the veyed. topical literature. Here, we brie y refer some basic approaches and distinguish our favoured one which 1 Introduction is used throughout the present paper as well The present paper deals with the very fun- as in our other related publications. damentals of some research program propo- O'Keefe and O'Leary (cf. [OO93]) found a sed by Rainer Knauf which aims at the de- quite intuitive and systematic approach to velopment and application of a methodolo- characterize and to distinguish veri cation gy to validate intelligent systems. What we and validation by the two circumscriptions propose is a Turing test{like methodology of building the system right and building the right system, respectively. using test cases. Although Turing's approach seems not The rst property relates a system to soto be useful for an assessment of intelligence me speci cation, which provides a rm basis (cf. [Hal87]), we'll use it as a corner sto- for the question whether or not the system ne of our own approach towards validating on hand is right. In contrast, the latter formulation asks whether or not some system intelligent systems.

is considered right, what somehow lies in the examination \intelligent". The discussion eye of the beholder. The essential di erence about the nature of intelligence is postponed, for a moment. is illustrated in gure 1. Reality Domain Behavior Characteristics Concepts Tasks ...

Implementation

Specification

XPS (The Model) x

Sxyz

Px_yz

y_sp

1 2

ooo

uu

i

3 abc

4

xyz1

mnoo

k_l

5

1

Behavior Characteristics Concepts Tasks ...

Verification Validation

Figure 1: Validation and Veri cation Throughout the present paper, we adopt this perspective and put the emphasis of our investigation on the validation issue. Our further work is focussing on the systematic development of test sets and on the development of some methodology for system validation by systematically testing.

3 The TURING Test as a Methaphor First, the classical approach (cf. [Tur50]) although well-known, is brie y outlined to be compared to the validation approach to which the present paper focusses. Very roughly, the Turing test means a scenario in which an artifact, usually a computer system, is checked for \intelligent behavior". In the basic setting, a human examiner is put into a dialogue situation with some remote partner not knowing whether or not this partner is another human or just a computer system. In case the human interrogater is convinced to be in contact with another human, this is understood as a suf cient criterion for calling the system under

3.1 Razing Down the TURING Test The idea seems acceptable. Faced to the absence of any sucient consensus on the notion of intelligence, this concept is speci ed implicitly rather than explicitly. But a closer look exhibits that the question for the nature of intelligence is not only put o , it is just missed. Halpern's criticism (cf. [Hal87]) can be summarized as follows: The so-called Turing test does not reveal anything about the inner \intelligence" of an interrogated system, it might rather answer the question whether a computer can be programmed so as to fool human beings, with some regularity, into thinking that it is another human being. Obviously, Turing did not make an attempt towards the gradation of intelligence. Neither the Turing test reveals a system's intelligence nor seems an ungraded approach like this appropriate to deal with phenomena as complex as natural resp. machine intelligence.

3.2 Analyzing Limits of the Approach It seems that the question whether or not this thought-experiment may be reasonably interpreted heavily depends on certain characteristics of the interrogated computer program. For illustration, assume any conventional computer program solving routine tasks with an enormous precision and within a remarkably short time. Numerical computations in arithmetics provide the simplest and quite typical instances of this application domain. Two insights are essential. First, the-

se are the cases in which a computer program is usually not understood to be \intelligent", although its computational power exceeds the intellectual power of every human being enormously. Second, human interrogators will normally quite easily identify the hidden partner to be a computer and not another human being. To sum up, there is a class of computer programs implementing straightforward computations to which the Turing test approach does not apply, by nature. More explicitly, computer programs intended to perform straightforward deterministic computations were not called intelligent, if they would behave like human experts. Even worse, such a behavior would be usually a strong indication of their incorrectness. The following seems to be the essence of the comparisons under investigation: In application domains where there exists a restricted and well-de ned target behavior to be implemented by some artifact (some computer system, in our case), nature rarely provides better solutions. Beyond those deterministic problem domains, arti cial intelligence research and engineering aims at solving problems by means of computer systems that are less deterministic and less well-speci ed or, even worse, possibly unspeci able. In many cases, humans deal with these problems quite adequately. Moreover, some humans are even called experts. There is not much hope to nd a formal characterization of problem domains to which arti cial intelligence approaches typically apply so as to separate them clearly from other domains. However, for clarity of our subsequent investigations, we circumscribe a certain type of problems in a partially formal way. In typical application domains of arti cial intelligence like classi cation, game playing, image understanding, planning, and speech recognition, input data frequently occur in a well-formalized form. Generally, there are several acceptable ways to react to certain input data. Given symptoms usually have multiple interpretations, in most positions

on a chess board, there are several reasonable moves, and so on. Thus, such a system's correct behavior is more suitably considered a relation than a function. Humans rarely behave functionally. Thus, programs implementing functions are inappropriate subjects of the Turing test, whereas the relational case might be revisited.

4 The TURING Test Revisited { Towards a Validation Methodology Throughout the sequel, we assume that some desired target behavior may be understood as a relation R. With respect to some domain of admissible input data I and some space of potential outputs O, the system's behavior is constrained by R  I  O and by further problem speci c conditions which might be dicult to express formally. The validity of a system means some tness with respect to R. The crucial diculty is that in most interesting application domains, the target behavior R is not directly accessible. It might even change over time. Consequently, the illustration of gure 1, which was intended to relate and to distinguish validation and veri cation, needs some re nement. The real situation illustrated by gure 2 suggests that there might be transmitters between reality and the AI system under investigation. Consequently, one has to measure a system's performance against the experts' expectations. For this purpose, the Turing test approach will rise again like a phoenix from the ashes.

4.1 Basic Formalizations In the minimal formal setting assumed so far, there are only two sets I and O. On these sets, there is somehow given a target relation R  I  O.

Reality Domain Behavior Characteristics Concepts Tasks ...

phenomenon R is assumed to be a particular relation E  I  O such that the following requirements of expertise are satis ed1: I

Implementation

Specification

XPS (The Model) x

Sxyz

Px_yz

y_sp

1 2

ooo

uu

i

3 abc

4

xyz1

mnoo

k_l

5

1

Experts

[Exp1] [Exp2]

E R  (E ) =  (R)

Behavior Characteristics Concepts Tasks ...

Verification Validation

Figure 2: Relating Validation and Veri cation Taking Experts as the Basic Knowledge Source Into Account Under the assumption of some topology on I , one may assume that R is decomposable into nitely many functional components Ryi , where y ranges over some nite subset of O and i ranges over some nite index set, such that (i) R = S Ryi , and (ii) Ryyi  I fyg (for all i and y), and (iii) every Ri is convex. Although these conditions seem rather natural, are frequently assumed (in most classi cation tasks, e.g.), and allow for certain algorithmic concepts as in [AKG96], e.g., they are already more speci c than required here. Therefore, we drop the requirement of convexity as well as the niteness of the decomposition. As a minimal setting, we stick with conditions (i) and (ii), only. An expert's knowledge should be consistent with the target phenomenon. This is an elementary requirement to characterize expertise. On the other hand, everybody who is not providing any response at all is always consistent. Thus, one needs another requirement of expertise, which is completeness. Informally, from the possibly large amount of answers to an admissible question, an expert should know at least one. An expert's knowledge about some target

I

Ideally, an expert's knowledge contains exactly the target relation: E =R [Omn] In some sense, [Exp1] is a condition of consistency, [Exp2] is a condition of completeness, and [Omn] is a property called omnipotence. An expertise E is said to be competent, exactly if it is complete and consistent. Thus, a team of n experts with their domain knowledge E1, : : : ,En is said to be  competent, exactly if it meets [n [n Ei  R and I ( Ei) = I (R) i=1

i=1

 omnipotent, exactly if it meets [n i=1

E =R i

Practically, because of not having R, we S n estimate R by i=1 Ei. That is, because we can't speak about \grey arrows" (cf. gure 2 above), formally.

4.2 Systematic System Validation Based on the minimal formalisms provided, we are now able to develop our focused validation scenario.  There is assumed an (implicitely given) desired target behavior R  I  O.  There is a team of n experts which is considered to be omnipotent.  There is some system to be validated. Its input / output relation is called S . 1

For notational convenience, we have introduced (sets of) pairs from

I to denote the projection of I  O to the rst argument.

\what questions to ask" and \what to do with the answers". This is the question for suitable test sets and for an evaluation methodology of the interrogation results.

Domain

R Implementation

Specification

E1

E2 E3

XPS (The Model) x

Sxyz

Px_yz

S

y_sp

1 2

ooo

uu

i

3 abc

4

xyz1

mnoo

k_l

5

1

Experts

Verification Validation

Figure 3: Relating Validation and Veri cation Based on a Minimal Setting of Formalized Human Expertise Ideally, a system is omnipotent, i.e. it meets S=R [S{Omn] Practically, a user may be satis ed with a system which is competent, i.e. which meets the requirements of consistency and completeness:

SR   (S ) =  (R) I

I

[S{Cons] [S{Compl]

But these three properties relating a given system's behavior directly to the target phenomenon are usually undecidable or, at least, unfeasible. Thus, as motivated above and as illustrated by our gure 2, validation deals with relating the system's behavior to the experts' knowledge. Figure 3 is incorporating the minimal setting of formal concepts developed within the present paper. The Turing test approach deals with a methodology to systematically provide some evidence for [S{Cons] and [S{Compl]. Intuitively, nothing better could be imagined. In general, there are no proofs of properties involving R, like [S{Cons] and [S{Compl], e.g. The key reason is that the target phenomenon R might be far beyond formalizations, whereas S is completely formal. The research to be undertaken concentrates on the derivation of suitable interrogation strategies. Especially, one needs to know

5 Essential Problems Beyond the Basic TURING Test Methodology This chapter is intended to outline a few basic generalizations of the elementary approach above. Every issue should be understood as a launching pad for future work: 1. context conditions between some elements within the sets R and Ei, 2. validation in open loops of environment - machine interactions, 3. validation of integrated man-machine systems, 4. vagueness and uncertainty caused by the experts' di erent opinions, and 5. improvements by learning, i.e. by using the validation results to improve the system.

6 Conclusions The Turing test turns out to be an appropriate metaphor for intelligent systems validation. System validation cannot guarantee, in general, that a system is the right one, i.e. that is is appropriate to solve all problems of a certain target class. The best system validation can certify is that a given system is at least as competent as some speci ed team of human experts. Last but not least, obviously, there is an urgent need to elaborate the concept of validity statements. The authors are convinced that the only way to use any Turing test result for system improvement is to produce validity statements which are somehow structured. Those statements should contain some information about the part of knowledge, wich is invalid, and about the

kind of invalidity. Any validity statement, [Dan97] Dankel, D.D. (ed.): Proc. of the 10th Florida AI Research Symposium which is just a degree of validity on any gi(FLAIRS-97), Daytona Beach, FL, USA, ven linear scale can't achieve that. May 11-14, 1997 [Hal87] Halpern, M.: Turing's test and the Acknowledgements ideology of arti cial intelligence. In: Arti cial Intelligence Review, 1:79-93, 1987. The second author's work has been substantially supported by a 2 months visit to the [HJK97] Herrmann, J.; Jantke, K.P.; Meme Media Laboratory of Hokkaido UniKnauf, R.: Using structural knowledge for versity, Sapporo, Japan, where he found exsystem validation. In: [Dan97]; pp. 82-86 cellent working conditions. Furthermore, the authors gratefully acknowledge the spon- [KGP97] Knauf, R.; Gonzalez, A.; Philippow, I.: Towards an Assessment of an AI soring by the German Research Fund (DFG) Sytems's Validity by a Turing Test. In: within the project on Expert System Valida[Dan97]; pp. 397-401 tion under grant Ph 49/3{1. It is worth to be mentioned that the au- [OO93] O'Keefe, R.M.; O'Leary, D.E.: Exthors' close cooperation with US American pert system veri cation and validation: A colleagues had an enormous impact on their survey and tutorial. In: Arti cial Intelliwork. They especially acknowledge the congence Review, 7:3-42, 1993. tinous discussions with Avelino Gonzalez, University of Central Florida. His outstan- [Stew96] Stewman, J.H. (ed.): Proc. of the 9th Florida AI Research Symposium ding practical experience forms a permanent (FLAIRS-96), Key West, FL, USA, May yardstick of the authors' e orts. 20-22, 1996 Last but not least, the authors wish to express their sincere gratitude to all members [Tur50] Turing, A.M.: Computing maof VAIS, the international and interdisciplchinery and intelligence. In: Mind, inary Research Institute for Validation of AI LIX(236):433-460, 1950. Systems2. Communications among the team members of the institute have always been very stimulating and in uential.

References [AG97] Abel, Th.; Gonzalez, A.: Utilizing Criteria to Reduce a Set of Test Cases for Expert System Validation. In: [Dan97]; pp. 402-406 [AKG96] Abel, Th.; Knauf, R.; Gonzalez, A.: Generation of a minimal set of test cases that is functionally equivalent to an exhaustive set, for use in knowledge-based system validation. In: [Stew96]; pp. 280284 2

cf. the internet appearance of the institute:

http://kiwi.theoinf.tu-ilmenau.de/vais/

Authors:

Dr.-Ing. Rainer Knauf, Dipl.-Ing. Thomas Abel, Prof. Dr.-Ing. habil. Ilka Philippow Technical University of Ilmenau, Germany PO Box 10 05 65 98684 Ilmenau Tel: ++49 3677 69 1445 Fax: ++49 3677 69 1665 eMail: [email protected] Prof. Dr. sc. nat. Klaus P. Jantke Forschungsinstitut fur InformationsTechnologien Leipzig an der HTWK Leipzig PO Box 3 00 66 04251 Leipzig Tel: ++49 172 8145354 eMail: [email protected]

Suggest Documents