Introduction to Artificial Intelligence - Hochschule Ravensburg ...

Slides to the book

Introduction to Articial Intelligence Wolfgang Ertel Springer-Verlag, 2011

www.hs-weingarten.de/~ertel/kibuch www.vieweg.de

10. April 2017

Teil I

Introduction What is Articial Intelligence (AI) History of AI Agents Knowledge-Based Systems

What is Articial Intelligence (AI)

I I I I I I I

What is intelligence? How can intelligence be measured? How does our brain work? Intelligent machine? Science ction? Rebuild human mind? Philosophy, e.g. mind-body dualism?

AI: Denition John McCarthy (1955): The aim of AI is to develop machines that behave as if they were intelligent.

Two simple Braitenberg-vehicles and their reaction to a light source.

AI: Denition

Encyclopedia Britannica: AI is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. According to this denition, every computer is an AI-system.

AI: Denition

Elaine Rich1 : Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better. I I

I

Still up-to-date in the year 2050! Humans are still better in many elds (e.g. understanding pictures, learning)! Computers are already better in many elds (e.g. playing chess)!

1 E. Rich.

Articial Intelligence.

McGraw-Hill, 1983.

Brain Research and Problem Solving

Dierent approaches: I How does the human brain work? I Problem oriented: building intelligent agents! I General store!

A small excerpt in the oer of AI-methods.

The Turing Test and Chatterbots Alan Turing:

The machine passes the test, if it can mislead Alice in 30% of the cases.

Joseph Weizenbaum (computer critic): the program Eliza talks to his secretary Demo: Cleverbot Demo: Simonlaven Demo: Alicebot

History of AI I 1931

The Austrian Kurt GÃ¶del shows that in rst-order predicate logic all true statements are derivable . In higher-order logics, on the other hand, there are true statements that are unprovable .

1937

Alan Turing points out the limits of intelligent machines with the halting problem .

1943

McCulloch und Pitts model neural networks and make the connection to propositional logic.

1950

Alan Turing denes machine intelligence with the Turingtest and writes about learning machines and genetic algorithms

1951

Marvin Minsky develops a neural network machine. With 3000 vacuum tubes he simulates 40 neurons.

History of AI II 1955

Arthur Samuel (IBM) builds a learning chess program that plays better than its developer .

1956

McCarthy organizes a conference in Dartmouth College. Here the name Articial Intelligence was rst introduced. Newell and Simon of Carnegie Mellon University (CMU) present the Logic Theorist, the rst symbol-processing computer program .

1958

McCarthy invents at MIT (Massachusettes Institute of Technology) the high-level language LISP. He writes programs that are capable of modifying themselves.

1959

Gelernter (IBM) builds the Geometry Theorem Prover.

1961

The General Problem Solver (GPS) by Newell und Simon imitates human thought .

History of AI III 1963

McCarthy founds the AI Lab at Stanford University.

1965

Robinson invents the resolution gic (Sec. ).

1966

Weizenbaum's program Eliza carries out dialogue with people in natural language (Sec. ).

1969

Minsky and Papert show in their book Perceptrons that the perceptron, a very simple neural network, can only represent linear functions (Sec. ).

1972

French scientist Alain Colmerauer invents the logic programming language PROLOG (Sec. ).

calculus

for predicate lo-

British physician de Dombal develops an expert system for diagnosis of acute abdominal pain . It goes unnoticed in the mainstream AI community of the time (Sec. ).

History of AI IV 1976

Shortlie and Buchanan develop MYCIN, an expert system for diagnosis of infectious diseases, which is capable of dealing with uncertainty (Ch. ).

1981

Japan begins, at great expense, the Fifth Generation Project with the goal of building a powerful PROLOG machine.

1982

R1, the expert system for conguring computers, saves Digital Equipment Corporation 40 million dollars per year .

1986

Renaissance of neural networks through, among others, Rumelhart, Hinton and Sejnowski . The system Nettalk learns to read texts aloud. (Sec. ).

1990

Pearl , Cheeseman , Whittaker, Spiegelhalter bring probability theory into AI with Bayesian networks (Sec. ). Multi-agent systems become popular.

History of AI V 1992

Tesauros TD-gammon program demonstrates the advantages of reinforcement learning.

1993

Worldwide RoboCup initiative to build soccer-playing autonomous robots .

1995

From statistical learning theory, Vapnik develops support vector machines, which are very important today.

1997

First international RoboCup competition in Japan. IBM's Deep Blue beats chess world champion Gary Kasparov by a score of 3.52.5

2003

The robots in RoboCup demonstrate impressively what AI and robotics are capable of achieving.

2006

Service robotics becomes a major AI research area.

History of AI VI 2009

First Google self driving car on a freeway in California.

2010

Autonomous robots start learning their policies.

2011

The IBM-Software Watson beats two human Masters in the US TV-show Jeopardy!. Watson understands natural language and answers very fast (Sec. ).

2015

Daimler presents rst autonomous truck on german highway. Google Self Driving Cars have logged 1.6 million kilometers on public roads and in cities. Deep learning (Sec. ) achieves very good image classication performance. With deep learning paintings in the style of former painters like Picasso can be automatically generated. AI goes creative!

History of AI VII

2016

GO-program beats European champion 5:0, based on deep learning for pattern recognition, reinforcement learning and Monte Carlo tree search.

Phases of AI history

I I I I I I

The rst beginnings Logic solves all problems The new connectionism Reasoning with uncertainty Distributed, autonomous and learning agents AI has grown up

power of representation Gödel

Turing

Dartmouth−conference resolution LISP

first−order logic

GPS

PROLOG

Jaynes probabilistic reasoning

symbolic

automated theorem provers PTTP, Otter, SETHEO, E−prover heuristic search Bayesian nets

decision tree learning

Hunt

ID3, CART

C4.5

Zadeh fuzzy logic propositional logic Davis/Putnam

deep belief networks

hybrid systems neural networks numeric

neuro− hardware

Minsky/Papert book

back−propagation

deep learning

support vector machines 1930

1940

1950

1960

1970

1980

1990

2000

2010

year

AI and Society

or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I AI kills jobs!

¤bische Zeitung, 19.1.2016

SchwÃ

AI and Society

or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I AI kills jobs! I Economic growth

creates new jobs!

The Economy must Grow!

Who earns the prots?

I I

Capitalists! Banks!

AI and Society

or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I Robots do the work! I We humans work less!

Stephen Hawking on reddit.com: Professor Hawking Whenever I teach AI, Machine Learning, or Intelligent Robotics, my class and I end up having what I call The Terminator Conversation. My point in this conversation is that the dangers from AI are overblown by media and non-understanding news, and the real danger is the same danger in any complex, less-than-fully-understood code: edge case unpredictability. In my opinion, this is dierent from dangerous AI as most people perceive it, in that the software has no motives, no sentience, and no evil morality, and is merely (ruthlessly) trying to optimize a function that we ourselves wrote and designed. Your viewpoints (and Elon Musk's) are often presented by the media as a belief in evil AI, though of course that's not what your signed letter says. Students that are aware of these reports challenge my view, and we always end up having a pretty enjoyable conversation. How would you represent your own beliefs to my class? Are our viewpoints reconcilable? Do you think my habit of discounting the layperson Terminator-style evil AI is naive? And nally, what morals do you think I should be reinforcing to my students interested in AI?

Answer:

You're right: media often misrepresent what is actually said. The real risk with AI isn't malice but competence. A superintelligent AI will be extremely good at accomplishing its goals, and if those goals aren't aligned with ours, we're in trouble. You're probably not an evil ant-hater who steps on ants out of malice, but if you're in charge of a hydroelectric green energy project and there's an anthill in the region to be ooded, too bad for the ants. Let's not place humanity in the position of those

Please encourage your students to think not only about how to create AI, but also about how to ensure its benecial use.

ants.

Stephen Hawking on reddit.com:

If machines produce everything we need, the outcome will depend on how things are distributed. Everyone can enjoy a life of luxurious leisure if the machine-produced wealth is shared, or most people can end up miserably poor if the machine-owners successfully lobby against wealth

So far, the trend seems to be toward the second option, with technology driving ever-increasing inequality. redistribution.

Why must the Economy grow?

I I I

The amount of money grows too fast! Reason #1: Money creation by private banks Reason #2: The interest system

The solution

I I I

Public money 2 3 Natural economic order 4 Tax reform: Remove income tax, create energy tax

2 Joseph Huber: Vollgeld 3 www.monetative.de

4 Margrit Kennedy: Occupy Money 5 Reiner KÃ¼mmel: The Second Law of Economics

5

Agents

Software agent

input software agent

user output

Hardware agent (autonomous robot)

hardware−Agent

sensor 1

...

perception

sensor n

software− agent

environment actuator 1

...

actuator m

manipulation

function from the set of all inputs to the set of all outputs. Markov decision process: only the current state is needed for the determination of the optimal action. Agent with memory: is not a function. Why? Reex-Agent:

Agent capable of learning Distributed agents goal-oriented agent

Example Spam lter: I

aims at assigning emails to their correct classes.

Agent 1: correct class desired SPAM Spam lter decides

I

desired SPAM

189 11

1 799

Agent 2: correct class desired SPAM Spam lter decides

I

desired SPAM

200 0

Which agent is better?

38 762

Cost-Oriented Agent

Denition The goal of a cost-oriented agent is to minimize the long-term cost (i.e. the average cost) caused by wrong decisions. The sum of all weighted errors results in the total cost.

Example Appendicitis diagnosis system LEXMED (see

Sec.

)

Environment

I I I I I I

observable (chess computer) partially observable (robot) deterministic (8-puzzle) non-deterministic (chess computer, robot) discrete (chess computer) continuous

Knowledge-Based Systems

strict separation of:

Knowledge

Inference mechanism

Knowledge base (KB)

Structure of a knowledge based system

knowledge sources

knowledge acquisition

data

knowledge− processing

user

knowledge engineer expert

knowledge engineering query

knowledge base

KB data− base

environment

machine learning

inference answer

Knowledge Engineering

Hot to get the knowledge into the AI system?

1. Knowledge engineer

2. Machine learning

Separation of knowledge and inference

Advantages: I

I

Inference is application-independent (e.g. medical expert system). Knowledge can be stored declaratively.

Knowledge Representation with formal languages

I I I I I

Propositional logic First-order logic (shortly: FOL). Probabilistic logic Fuzzy logic Decision trees

Teil II

Propositional Logic Syntax Semantics Proof systems Resolution Horn clauses Computability and Complexity

if it is raining the street is wet.

Written more formally it is raining

⇒

the street is wet.

Syntax

Denition Let Op = {¬, ∧ , ∨ , ⇒ , ⇔ , ( , ) } be the set of logical operators and Σ a set of symbols. The sets Op, Σ and {t, f } are pairwise disjoint. Σ is called the signature and its elements are the proposition variables. The set of propositional logic formulas is now recursively dened:

• t and f are (atomic) formulas.

• All proposition variables, that is all elements from Σ, are (atomic) formulas. • If A and B are formulas, then ¬A, (A), A ∧ B, A ∨ B, A ⇒ B, A ⇔ B are also formulas.

Denition We read the symbols and operators in the following way:

t f ¬A A∧B A∨B A ⇒ B A ⇔ B

: : : : : : :

true false not A A and B A or B if A then B A if and only if B

(negation) (conjunction) (disjunction) (implication6 ) (equivalence)

With Σ = {A, B, C }, for example

A ∧ B,

A ∧ B ∧ C,

A ∧ A ∧ A,

(¬A ∧ B) ⇒ (¬C ∨ A)

C ∧ B ∨ A,

are also formulas. (((A)) ∨ B) is also a syntactically correct formula. The formulas dened in this way are so far purely syntactic constructions without meaning. We are still missing the semantics.

Semantics

Is the formula

true?

A∧B

Denition A mapping I : Σ → {w , f }, which assigns a truth value to every proposition variable, is called an assignment or interpretation or also a world. Every propositional logic formula with n dierent variables has 2n dierent interpretations.

Truth table

A B t t t f f t f f

(A) ¬A A ∧ B A ∨ B A ⇒ B A ⇔ B t f t t t t t f f t f f f t f t t f f t f f t t

The empty formula is true for all interpretations. Operator priorities: ¬, ∧ , ∨ , ⇒ , ⇔ .

Denition Two formulas F and G are called semantically equivalent if they take on the same truth value for all interpretations. We write F ≡ G . natural language, e.g. A ≡ B language: logic, e.g. A ⇔ B

Meta language: Object

Theorem The operations ∧ , ∨ are commutative and associative, and the following equivalences are generally valid: ¬A ∨ B A ⇒ B (A ⇒ B) ∧ (B ⇒ A) ¬(A ∧ B) ¬(A ∨ B) (A ∨ B) ∧ (A ∨ C ) A ∧ (B ∨ C ) A ∨ ¬A A ∧ ¬A A∨f A∨w A∧f A∧w

⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔

A ⇒ B ¬B ⇒ ¬A (A ⇔ B) ¬A ∨ ¬B ¬A ∧ ¬B A ∨ (B ∧ C ) (A ∧ B) ∨ (A ∧ C ) w f A w f A

(implication) (contraposition) (equivalence) (De Morgan's law) (distributive law) (tautology) (contradiction)

Proof:

only the rst equivalence:

A B t t t f f t f f

¬A ¬A ∨ B A ⇒ B (¬A ∨ B) ⇔ (A ⇒ B) f t t t f f f t t t t t t t t t

The proofs for the other equivalences are similar and are

recommended as exercises for the reader (Aufgabe

??).

2

Varianten der Wahrheit According to how many interpretations in which a formula is true, we can divide formulas into the following classes:

Denition A formula is called if it is true for at least one interpretation.

•

satisable

•

logically valid

•

unsatisable

or simply valid if it is true for all interpretations. True formulas are also called tautologies. if it is not true for any interpretation.

Every interpretation that satises a formula is called a of the formula.

model

Clearly the negation of every generally valid formula is unsatisable. The negation of a satisable, but not generally valid formula F is satisable.

Does a formula Q follow from the

knowledge base

WB ?

Denition A formula WB entails a formula Q (or Q follows from WB ) if every model of WB is also a model of Q . We write WB |= Q . I I

I

I I

Semantic concept! Every formula chooses a subset of the set of all interpretations as its model. Tautologies such as A ∨ ¬A, for example, do not restrict the number of satisfying interpretations because their proposition is empty. The empty formula is therefore true in all interpretations. For every tautology T then ∅ |= T .

Truth table for implication:

A B A ⇒ B t t t t f f f t t f f t An arbitrary implication A ⇒ B is clearly always true except with the interpretation A 7→ t, B 7→ f . Assume that A |= B holds. This means that for every interpretation that makes A true, B is also true. The critical second row of the truth table doesn't even apply in that case. Therefore A ⇒ B is true, which means that A ⇒ B is a tautology. Thus one direction of the statement has been shown.

Theorem (Deduction theorem)

A |= B I

I

if and only if

|= A ⇒ B .

The truth table method is a proof system for propositional logic! Disadvantage: very large computing time in the worst case (2n interpretations).

If a formula WB entails a formula Q , then by the deduction theorem WB ⇒ Q is a tautology. Therefore the negation ¬(WB ⇒ Q) is unsatisable. We have:

¬(WB ⇒ Q) ≡ ¬(¬WB ∨ Q) ≡ WB ∧ ¬Q

Therefore WB ∧ ¬Q is also satisable. Thus:

Theorem

(proof by contradiction)

WB ∧ ¬Q

is unsatisable.

WB |= Q

if and only if

To show that the query Q follows from the knowledge base WB , we can also add the negated query ¬Q to the knowledge base and derive a contradiction.

Fields of application:

I I

In Mathematics In many automatic proof calculi, i.a. Resolution, PROLOG

Derivation:

Calculus:

syntactic manipulation of the formulas WB and Q by application of inference rules with the goal of greatly simplifying them, such that in the end we can instantly see that WB |= Q . syntactic proof system (derivation).

Syntactic derivation and semantic entailment

B

derivation

/

Q

syntac (form

interpretation

WB)

`

|=

entailment

/

(Q)

Mod

seman (interp

To keep automatic proof systems as simple as possible, these are usually made to operate on formulas in conjunctive normal form.

Denition A formula is in conjunctive normal if it consists of a conjunction

form (CNF)

if and only

K1 ∧ K2 ∧ . . . ∧ Km of clauses. A clause Ki consists of a

disjunction

(Li1 ∨ Li2 ∨ . . . ∨ Lini ) of literals. Finall, a literal is a variable (positive literal) or a negated variable (negative literal). The formula (A ∨ B ∨ ¬C ) ∧ (A ∨ B) ∧ (¬B ∨ ¬C ) is in conjunctive normal form. The conjunctive normal form does not place a restriction on the set of formulas because:

Theorem Every propositional logic formula can be transformed into an equivalent conjunctive normal form.

Example

Proof calculus: modus ponens

A, A ⇒ B . B Modus ponens is sound but not complete.

A ∨ B, ¬B ∨ C . A∨C

resp.

A ∨ B, B ⇒ C . A∨C

I I

The derived clause is called resolvent. The resolution rule is a generalization of the modus ponens.

Example

(1)

General resolution rule: (A1 ∨ . . . ∨ Am ∨ B), (¬B ∨ C1 ∨ . . . ∨ Cn ) . (A1 ∨ . . . ∨ Am ∨ C1 ∨ . . . ∨ Cn )

We call the literals B and ¬B

complementary.

Theorem The resolution calculus for the proof of unsatisability of formulas in conjunctive normal form is sound and complete.

WB must be consistent!. Otherwise anything can be derived from WB (see Aufgabe

The knowledge base

??).

(2)

Example Despite studying English for seven long years with brilliant success, I must admit that when I hear English people speaking English I'm totally perplexed. Recently, moved by noble feelings, I picked up three hitchhikers, a father, mother, and daughter, who I quickly realized were English and only spoke English. At each of the sentences that follow I wavered between two possible interpretations. They told me the following (the second possible meaning is in parentheses): The father: We are going to Spain (we are from Newcastle). The mother: We are not going to Spain and are from Newcastle (we stopped in Paris and are not going to Spain). The daughter: We are not from Newcastle (we stopped in Paris). What about this charming English family?

Three steps: I Formalization I Transformation into normal form I Proof (often very dicult) practising very important! (see tasks ????).

(S ∨ N) ∧ [(¬S ∧ N) ∨ (P ∧ ¬S)] ∧ (¬N ∨ P).

Factoring out ¬S in the middle sub-formula brings the formula into CNF in one step.

WB ≡ (S ∨ N)1 ∧ (¬S)2 ∧ (P ∨ N)3 ∧ (¬N ∨ P)4 . Now we begin the resolution proof (without a query Q ). Res(1,2) : (N)5 Res(3,4) : (P)6 Res(1,4) : (S ∨ P)7 Empty clause not derivable, thus KB is non-contradictory.

To show that ¬S holds, we add the clause (S)7 to the set of clauses as a negated query. Res(2,7) : ()8 Thus it holds ¬S ∧ N ∧ P . The charming English family evidently comes from Newcastle, stopped in Paris, but is not going to Spain.

Logic puzzle number 28 from [Ber89]:

The High Jump

reads:

Three girls practice high jump for their physical education nal exam. The bar is set to 1.20 meters. I bet, says the rst girl to the second, that I will make it over if, and only if, you don't. If the second girl said the same to the third, who in turn said the same to the rst, would it be possible for all three to win their bets?

We show through proof by resolution that not all three can win their bets.

Formalization:

The rst girl's jump succeeds: A, the second girl's jump succeeds: B , the third girl's jump succeeds: C . First girl's bet: (A ⇔ ¬B), second girl's bet: (B ⇔ ¬C ), third girl's bet: (C ⇔ ¬A). Claim: the three cannot all win their bets:

Q ≡ ¬((A ⇔ ¬B) ∧ (B ⇔ ¬C ) ∧ (C ⇔ ¬A)) It must now be shown by resolution that ¬Q is unsatisable.

Transformation into CNF:

First girl's bet:

(A ⇔ ¬B) ≡ (A ⇒ ¬B) ∧ (¬B ⇒ A) ≡ (¬A ∨ ¬B) ∧ (A ∨ B) The bets of the other two girls undergo analogous transformations, and we obtain the negated claim

¬Q ≡ (¬A ∨ ¬B)1 ∧ (A ∨ B)2 ∧ (¬B ∨ ¬C )3 ∧ (B ∨ C )4 ∧ (¬C ∨ ¬A)5 From there we derive the empty clause using resolution: Res(1,6) : Res(4,7) : Res(2,5) : Res(3,9) : Res(8,10) : Thus the claim has been proved.

(C ∨ ¬B)7 (C )8 (B ∨ ¬C )9 (¬C )10 ()

Horn clauses A clause in conjunctive normal form contains positive and negative literals and can be represented in the form

(¬A1 ∨ . . . ∨ ¬Am ∨ B1 ∨ . . . ∨ Bn ) This clause can be transformed in

A1 ∧ . . . ∧ Am ⇒ B1 ∨ . . . ∨ Bn .

Example If the weather is nice and there is snow on the ground, I will go skiing or I will work. (non-denite clause) If the weather is nice and there is snow on the ground, I will go skiing. (denite clause)

Denition Clauses with at most one positive literal of the form

(¬A1 ∨ . . . ∨ ¬Am ∨ B) or (¬A1 ∨ . . . ∨ ¬Am ) or B or (equivalently)

A1 ∧ . . . ∧ Am ⇒ B

or A1 ∧ . . . ∧ Am ⇒ f

or B.

are named Horn clauses (after their inventor). A clause with a single positive literal is a fact. In clauses with negative and one positive literal, the positive literal is called the head. To better understand the representation of Horn clauses, the reader may derive them from the denitions of the equivalences we have currently been using (Aufgabe ??).

Example: Knowledge base:

(nice_weather)1 (snowfall)2 (snowfall ⇒ snow)3 (nice_weather ∧ snow ⇒

skiing)4

Does skiing hold? Inference rule (generalized modus ponens):

A1 ∧ . . . ∧ Am , A1 ∧ . . . ∧ Am ⇒ B . B Proof for

skiing:

MP(2,3) : (snow)5 MP(1,5,4) : (skiing)6 . Modus ponens is complete for propositional logic Horn clauses.

starts with facts and nally derives the query backward chaining: starts with the query and works backwards until the facts are reached

forward chaining:

SLD resolution Selection rule driven linear resolution for denite clauses .

Example, augmented by the negated query (skiing ⇒ f )

SLD resolution:

(nice_weather)1 (snowfall)2 (snowfall ⇒ snow)3 (nice_weather ∧ snow ⇒ (skiing ⇒ f )5

Res(5,4) : Res(6,1) : Res(7,3) : Res(8,2) :

(nice_weather ∧ (snow ⇒ f )7 (snowfall ⇒ f )8 ()

skiing)4

snow

⇒ f )6

further processing is always done on the currently derived clause. Reduction of the search space. The literals of the current clause are always processed in a xed order (for example, from right to left) (Selection rule driven ). The literals of the current clause are called subgoals. The literals of the negated query are the goals.

I linear resolution:

I I

I I

Inference rule:

A1 ∧ . . . ∧ Am ⇒ B1 , B1 ∧ B2 ∧ . . . ∧ Bn ⇒ f . A1 ∧ . . . ∧ Am ∧ B2 ∧ . . . ∧ Bn ⇒ f

SLD resolution and PROLOG

I

I

I I

The proof (contradiction) is found, if the list of subgoals of the current clauses (the so-called goal stack) is empty. If, for a subgoal ¬Bi , there is no clause with the complementary literal Bi as its clause head, the proof terminates and no contradiction can be found. PROLOG programmes consist of predicate logic Horn clauses. Their processing is achieved by means of SLD resolution.

Computability and Complexity I

I

I I

I

I I

I

The truth table method determines every model of any formula in nite time. The sets of unsatisable, satisable, and valid formulas are decidable. T (n) = O(2n ) Optimization: semantic tree, grows exponentially in the worst case. At resolution, the number of derived clauses grows exponentially with the number of clauses in the worst case. S. Cook: the 3-SAT problem is NP-complete. 3-SAT is the set of all CNF formulas whose clauses have exactly three literals. For Horn clauses: the computation time for testing satisability grows only linearly as the number of literals in the formula increases.

Applications and Limitations

I I I I

I

I

Digital technology Verication of digital circuits Generation of test patterns Simple AI applications: discrete varibles, few values, no relations between variables Probabilistic logic (Ch. ) uses propositonal logic, modelles uncertainty Fuzzy logic, allows an innite number of truth values

Teil III

First-order Predicate Logic Syntax Semantics Quantiers and Normal Forms Proof Calculi Resolution Automated Theorem Provers

Statement: Robot 7 is situated at the xy position (35,79)

propositional logic variable: Robot_7_is_situated_at_xy_position_(35,79)

100 robots on a grid of 100 × 100 points. ⇒ 100 · 100 · 100 = 1 000 000 = 106 variables

Relation Robot

A

is to the right of robot

B.

(100 · 99)/2 = 4950 ordered pairs of x values. 104 formulas of the type: Robot_7_is_to_the_right_of_robot_12

⇔

Robot_7_is_situated_at_xy_position_(35,79)

∧ Robot_12_is_situated_at_xy_position_(10,93) ∨ ... with (104 )2 · 0.495 = 0.495 · 108 alternatives on the right side.

First-order predicate logic: position(number, xPosition, yPosition)

∀u ∀v is_further_right(u, v ) ⇔ ∃xu ∃yu ∃xv ∃yv position(u, xu , yu ) ∧ xu > xv ,

position(v , xv , yv )

∧

Syntax

Terms, e.g.: f (sin(ln(3)), exp(x))

Denition Let V be a set of variables, K a set of constants, and F a set of function symbols. The sets V , K and F are pairwise disjoint. We dene the set of terms recursively:

• All variables and constants are (atomic) terms.

• If t1 , . . . , tn are terms and f an n-place function symbol, then f (t1 , . . . , tn ) is also a term.

Denition P

Let

be a set of predicate symbols.

Predicate logic formulas are built

as follows:

•

If

•

If

•

If

t1 , . . . , tn are terms and p an n-place p(t1 , . . . , tn ) is an (atomic) formula.

A and B are formulas, then ¬A, (A), A ∧ B, A ∨ B, A ⇒ B, A ⇔ B x

is a variable and

formulas.

∀

• p(t1 , . . . , tn ) •

predicate symbol, then

A

a formula, then

are also formulas.

∀x A and ∃x A are also ∃ the existencial quantier.

is the universal quantier and and

¬p(t1 , . . . , tn )

are called literals.

Formulas in which every variable is in the scope of a quantier are called

rst-order sentences or closed formulas. Variables which are not in the scope of a quantier are called free variables. • The denitions ?? (CNF) and ?? (Horn clauses) hold for formulas of predicate logic literals analogously.

Examples: Formula

∀x ∀x

(x) ⇒

(x) ∧

∀x

All brown frogs are big

(x, cake)

Everyone likes cake

likes

(x, cake)

Not everyone likes cake

¬∃x

likes

(x, cake)

No one likes cake

∃x ∀y

likes

(y , x)

There is something that everyone likes

∃x ∀y

likes

(x, y )

There is someone who likes everything

∀x ∃y

likes

(y , x)

Everything is loved by someone

∀x ∃y

likes

(x, y )

(x) ⇒ (x) ∧

customer

(x) ∧ ∀y

baker

(x)

big

¬∀x

customer

∃x ∃x

brown

likes

All frogs are green

green

(x) ⇒

frog

∀x

Description

(x)

frog

Everyone likes something

(

, x)

( ,

)

likes bob

likes x bob

(y ) ⇒

customer

Bob likes every customer There is a customer whom bob likes

( , )

mag x y

There is a baker who likes all of his customers

Semantics

Denition An

assignment

or

interpretation

B is dened as

• a mapping from the set of constants and variables K ∪ V to a set W of names of objects in the world; • a mapping from the set of function symbols to the set of functions in the world. Every n-place function symbol is assigned an n-place function; • a mapping from the set of predicate symbols to the set of relations in the world. Every n-place predicate symbol is assigned an n-place relation.

Example Constants:

c1 , c2 , c3

, two-place function symbol plus , two-place

predicate symbol gr .

F ≡

gr(plus(c1 , c3 ), c2 )

Choose interpretation:

B1 : c1 7→ 1, c2 7→ 2, c3 7→ 3,

plus

7→ +,

gr

7→ >

Thus the formula is mapped to

1 + 3 > 2,

resp.

4>2

after evaluation

G on {1, 2, 3, 4}: G = {(4, 3), (4, 2), (4, 1), (3, 2), (3, 1), (2, 1)}. Because (4, 2) ∈ G , F is true under the interpretation B1 . The greater-than relation

B2 : c1 7→ 2, c2 7→ 3, c3 7→ 1,

we obtain

2 − 1 > 3,

bzw.

plus

7→ −,

gr

7→ >,

1 > 3.

(1, 3) is not a member of G .

Obviously, the truth of a formula in PL1 depends on the interpretation.

Denition •

An atomic formula

p(t1 , . . . , tn )

is

true under the interpretation B if, t1 , . . . , tn

after interpretation and evaluation of all terms interpretation of the predicate

p

and

through the n-place relation

r,

it holds

that

(B(t1 ), . . . , B(tn )) ∈ r . •

The truth of quantierless formulas follows from the truth of atomic formulas as in propositional calculus through the semantics of the logical operators.

•

A formula

•

A formula

∀x F

is true under the interpretation

∃x F

is true under the interpretation

B

exactly when it is true

given an arbitrary change of the interpretation for the variable

a interpretation for

x

B

x.

exactly when there is

which makes the formula true.

The denitions of semantic equivalence of formulas, for the concepts satisable, true, unsatisable, and model, along with semantic entailment (denitions

??, ??, ??) carry over unchanged from propositional calculus

to predicate logic.

Theorem The theorems ?? (deduction theorem) and ?? (proof by contradiction) hold analogously for PL1.

Example A family tree:

Henry A.

Karen A.

Franz A.

Anne A.

Oscar A.

Mary B.

Eve A.

Isabelle A.

Oscar B.

Clyde B.

Relation Child = { (Oscar A., Karen A., Frank A.), (Henry A., Anne A., Oscar A.),

(Mary B., Karen A., Frank A.), (Eve A., Anne A., Oscar A.),

(Isabelle A., Anne A., Oscar A.), (Clyde B., Mary B., Oscar B.) }

one-place relation: Female = {Karen A., Anne A., Mary B., Eve A., Isabelle A.}

Predicates child(x, y , z)

with the semantic

B(child(x, y , z)) = w ≡ (B(x), B(y ), B(z)) ∈ Kind.

Belegung B(oscar) = Oscar A., B(eve) = Eve A., B(anne) = Anne A. child(eve, anne, oscar ) is true! Does hold?

child(eve, oscar , anne)

also

∀x ∀y ∀z ∀x ∀y

descendant(x, y )

See also Aufgabe

??!

child(x, y , z)

⇔

⇔

child(x, z, y ),

∃z child(x, y , z) ∨ (∃u ∃v child(x, u, v ) ∧

descendant(u, y )

Knowledge base

WB ≡

Is Is

∧ female(anne) ∧ female(mary) ∧ female(eve) ∧ female(isabelle) ∧ child(oscar,karen,franz) ∧ child(mary,karen,franz) ∧ child(eve,anne,oscar) ∧ child(henry,anne,oscar) ∧ child(isabelle,anne,oscar) ∧ child(clyde,mary,oscarb) ∧ (∀x ∀y ∀z child(x, y , z) ⇒ child(x, z, y )) ∧ (∀x ∀y descendant(x, y ) ⇔ ∃z child(x, y , z) ∨ (∃u ∃v child(x, u, v ) ∧ descendant(u, y ))).

female(karen)

child(eve,oscar,anne)

derivable? derivable?

descendant(eve,franz)

Equality

Equality axioms

∀x x = x (reexivity) ∀x ∀y x = y ⇒ y = x (symmetry) ∀x ∀y ∀z x = y ∧ y = z ⇒ x = z (transitivity). ∀x ∀y x = y ⇒ f (x) = f (y )

(substitution axiom)

(3)

(4)

Replacement of terms

Example ∀x x = 5 ⇒ x = y replace

y

by sin(x):

∀x x = 5 ⇒ x = sin(x)

wrong!

correct:

∀x x = 5 ⇒ x = sin(z)

Denition We write ϕ[x/t] for the formula that results when we replace every free occurence of the variable x in ϕ with the term t . Thereby we do not allow any variables in the term t that are quantied in ϕ. In those cases variables must be renamed to ensure this.

Example ∀x x = y , the free variable y is replaced by the term x + 1, the result is ∀x x = x + 1. With correct substitution we obtain the formula ∀x x = y + 1 which has a much dierent If, in the formula

semantic.

Quantiers and Normal Forms Denition

??:

∀x p(x) ≡ p(a1 ) ∧ , . . . , ∧ p(an )

for all constants a1 . . . an in K .

de Morgan's law:

∃x p(x) ≡ p(a1 ) ∨ . . . ∨ p(an ) ∀x ϕ ≡ ¬∃¬ ϕ.

Example Everyone wants to be loved be loved .

≡

Nobody doesn't want to

Denition A predicate logic formula ϕ is in holds that

• ϕ = Q1 x1 . . . Qn xn ψ .

• ψ is a quantierless formula. • Qi ∈ {∀, ∃} for i = 1, . . . , n.

prenex normal form

if it

Caution

Rename variable:

∀x p(x) ⇒ ∃x q(x). ∀x p(x) ⇒ ∃y q(y )

Bring quantier to the front

∀x ∃y p(x) ⇒ q(y ). How is this done with the following formula?

(∀x p(x)) ⇒ ∃y q(y )

(5)

Example The convergence of a series

(an )n∈N

to a limit

a

is dened by

∀ε > 0 ∃n0 ∈ N ∀n > n0 |an − a| < ε formal:

|x|, a(n) for an , x − y, el(x, y ) for x ∈ y , gr(x, y ) abs(x) for

minus(x, y ) for

for

x > y:

∀ε (gr (ε, 0) ⇒ ∃n0 (el(n0 , N) ⇒ ∀n (gr (n, n0 ) ⇒ gr (ε, abs(minus(a(n), a)))))), Eliminating implications:

∀ε (¬gr (ε, 0) ∨ ∃n0 (¬el(n0 , N) ∨ ∀n (¬gr (n, n0 ) ∨ gr (ε, abs(minus(a(n), a)))))). Quantiers to the front:

∀ε ∃n0 ∀n (¬gr (ε, 0) ∨ ¬el(n0 , N) ∨ ¬gr (n, n0 ) ∨ gr (ε, abs(minus(a(n), a))))

(6)

Theorem Every predicate logic formula can be transformed into an equivalent formula in prenex normal form.

Skolemization ∀x1 ∀x2 ∃y1 ∀x3 ∃y2 p(f (x1 ), x2 , y1 ) ∨ q(y1 , x3 , y2 ). ∀x1 ∀x2 ∀x3 ∃y2 p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , y2 ) ∀x1 ∀x2 ∀x3 p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , h(x1 , x2 , x3 )) p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , h(x1 , x2 , x3 )).

Equation

??

¬gr (ε, 0) ∨ ¬el(n0 (ε), N) ∨ ¬gr (n, n0 (ε)) ∨ gr (ε, abs(minus(a(n), a))). By dropping the variable n0 , the Skolem function can receive the name n0 .

Skolemization

∀x1 . . . ∀xn ∃y ϕ is replaced by ∀x1 . . . ∀xn ϕ[y /f (x1 , . . . , xn )] during which f may not appear in ϕ. In ∃y p(y ), then y must be replaced by a constant I

I

The resulting formula is no longer equivalent to the output formula. However, the satisability remains unchanged.

Programme scheme

NormalFormTransformation(Formula)

1. Transformation into prenex normal form: Transformation into conjunctive normal form (Satz

??):

Elimination of equivalences. Elimination of implications. Repeated application of de Morgan's law and distributive law. Renaming of variables if necessary. Factoring out universal quantiers.

2. Skolemization: Replacement of existencially quantied variables by new Skolem functions. Deletion of resulting universal quantiers.

runtime

1. exponential (naive), 1. polynomial [Ede91], 2. polynomial

Die universelle Logikmaschine

Natural reasoning

I I I I I

Gentzen calculus Sequent calculus Meant to be applied by humans Intuitive inference rules Work on arbitrary PL1 formulas

Example Two simple inference rules:

A, A ⇒ B B

(Modus Ponens, MP)

∀x A (∀-Elimination, ∀E ). A[x/t]

Variable x must be replaced by a ground term t Proof for child(eve,oscar,anne): WB :

1

child(eve,anne,oscar)

WB :

2

∀x ∀y ∀z child(x, y , z) ⇒ child(x, z, y )

∀E (2) : x/eve, y /anne, z/oscar

3

child(eve,anne,oscar) ⇒ child(eve,oscar,anne)

MP(1, 3)

4

child(eve,oscar,anne)

calculus not complete!

Theorem (Gödel's completeness theorem) First-order predicate logic is complete. That is, there is a calculus with which every

WB WB ` ϕ.

proposition that is a consequence of a knowledge base can be proved. If

I I I

WB |= ϕ,

then it holds that

Every true proposition in rst-order predicate logic is provable. Is the reverse also true? Is everything we can derive syntactically actually true?

Theorem (Correctness) There are calculi with which only true propositions can be proved. That is, if

WB |= ϕ. I

I

WB ` ϕ

holds, then

Provability and semantic consequence are equivalent concepts, as long as the calculus is correct and complete. Calculi of natural deduction are rather unsuited for automization.

Resolution

Resolution proof for Beispiel

??:

Request Q ≡ ¬child(eve,oscar,anne)

Knowledge base in conjunctive normal form: WB ∧ ¬Q

≡

(child(eve,anne,oscar))1 ∧ (¬child(x, y , z) ∨ child(x, z, y ))2 ∧ (¬child(eve,oscar,anne))3 .

Proof: (2) x/eve, y /anne, z/oscar : (¬child(eve,anne,oscar) ∨ child(eve,oscar,anne))4 Res(3, 4) : (¬child(eve,anne,oscar))5 Res(1, 5) : ()6 ,

Beispiel

I Everybody knows his own mother I Does Henry know anyone?

(knows(x, mother(x)))1 ∧ (¬knows(henry, y ))2 . Unication:

x/henry, y /mother(henry)

(knows(henry, mother(henry)))1 ∧ (¬knows(henry, mother(henry )))2 .

Denition Two literals are called uniable if there is a substitution σ for all variables which makes the literals equal. Such a σ is called a unier. A unier is called the most general unier (MGU) if all other uniers can be obtained from it by substitution of variables.

Example We want to unify the literals

p(f (g (x)), y , z)

and

p(u, u, f (u)).

Several uniers are

σ1 σ2 σ3 σ4 σ5

: : : : :

x/h(v ), x/h(h(v )), x/h(a), x/a,

where

σ1

σ1

y /f (g (x)), y /f (g (h(v ))), y /f (g (h(h(v )))), y /f (g (h(a))), y /f (g (a)),

z/f (f (g (x))), z/f (f (g (h(v )))), z/f (f (g (h(h(v ))))), z/f (f (g (h(a)))), z/f (f (g (a))),

u/f (g (x)), u/f (g (h(v ))) u/f (g (h(h(v )))) u/f (g (h(a))) u/f (g (a))

is the most general unier. The other uniers result from

through the substitutions

x/h(v ), x/h(h(v )), x/h(a), x/a.

I I I I

Predicate symbols can be treated like function symbols. That is, the literal is treated like a term. Arguments of functions are processed sequentially. Terms are unied recursively over the term structure.

Complexity I I

I

I

The simplest unication algorithms are very fast in most cases. Worst case: the computation time grows exponentially with the size of the terms. In practice, nearly all unication attempts fail, in most cases the worst case complexity has no dramatic eect. The fastest unication algorithms have nearly linear complexity [Bib92].

General resolution rule for predicate logic

Denition The resolution rule for two clauses in conjunctive normal form reads

(A1 ∨ . . . ∨ Am ∨ B), (¬B 0 ∨ C1 ∨ . . . ∨ Cn ) σ(B) = σ(B 0 (σ(A1 ) ∨ . . . ∨ σ(Am ) ∨ σ(C1 ) ∨ . . . ∨ σ(Cn )) (7) where σ is the MGU of B and B 0 .

Theorem The resolution rule is correct. That is, the resolvent is a semantic consequence of the two parent clauses.

Example Russel's paradox: There is a barber who shaves everyone who does not shave himself. Formalized in PL1:

∀x

shaves(barber, x)

⇔ ¬shaves(x, x)

Clause form (see Aufgabe ??):

(¬shaves(barbier, x) ∨ ¬shaves(x, x))1 ∧ (shaves(barbier, x) ∨ shaves(x, x (8) I No contradiction derivable! I Thus: the resolution is not complete!

Denition Factorization

of a clause is accomplished by

(A1 ∨ A2 ∨ . . . ∨ An ) σ(A1 ) = σ(A2 ) , (σ(A2 ) ∨ . . . ∨ σ(An )) where σ is the MGU of A1 and A2 . Now we can derive a contradiction from Equation

??

Fak(1, σ : x/barber ) : (¬shaves(barber, barber))3 Fak(2, σ : x/barber ) : (shaves(barber, barber))4 Res(3, 4) : ()5 . and we assert

Theorem The resolution rule (??) together with the factorization rule (??) is refutation complete. That is, by application of factorization and resolution steps, the empty clause can be derived from any unsatisable formula in conjunctive normal form.

Combinatorial Explosion of the Search Space

Resolution Strategies

Search space reduction by certain strategies:

Unit Resolution I

I I

Prioritizes resolution steps in which one of the two clauses consists of only one literal, called a unit clause. Complete Heuristic (search space reduction not guaranteed)

Set of support strategy

Set of support (SOS) ⊂ WB ∧ ¬Q I

I I I I I

Resolution only between clauses from the SOS and the complement. Resolvent is added to the SOS. Search space reduction guaranteed. Not complete. Complete, if WB ∧ ¬Q\SOS satisable. Often the negated query ¬Q is used as the initial SOS.

Input resolution

I

I I

A clause from the input set WB ∧ ¬Q must be involved in every resolution step. Search space reduction guaranteed. Not complete.

Pure literal rule

I

I I I

All clauses that contain literals for which there are no complementary literals in other clauses are deleted. Search space reduction guaranteed. Complete. Is used by practically all resolution provers.

Subsumption

I

I

I I I

If the literals of a clause K1 represent a subset of the literals of the clause K2 , then K2 can be deleted. For example, the clause (raining(today) ⇒ street_wet(today )) is redundant if street_wet(today ) is already valid. Search space reduction guaranteed. Complete. Is used by practically all resolution provers.

Equality Equality axioms:

∀x x = x ∀x ∀y x = y ⇒ y = x ∀x ∀y ∀z x = y ∧ y = z ⇒ x = z Solution:

special inference rules for equality.

Demodulation:

An equation t1 = t2 is applied by means of unication to a term t as follows: t1 = t2 , (. . . t . . .), σ(t1 ) = σ(t) . (. . . σ(t2 ) . . .) Paramodulation,

works with conditional equations [Bib92; BB92].

t1 = t2 allowed: 1. Substitution of t1 by t2 2. Substitution of t2 by t1

Solution: I Equations are often used in one direction only ⇒ directed equations I Term rewriting systems I

Special equality provers

Automated Theorem Provers

Otter, 1984

[Kal01] I Successful resolution prover (with equality handling) I L. Wos, W. McCune: Argonne National Laboratory, Chicago Currently, the main application of Otter is research in abstract algebra and formal logic. Otter and its predecessors have been used to answer many open questions in the areas of nite semigroups, ternary Boolean algebra, logic calculi, combinatory logic, group theory, lattice theory, and algebraic geometry.

SETHEO, 1987

[Let+92] I PROLOG technology I Warren Abstract Machine I W. Bibel, J. Schumann, R. Letz: Munich Technical University I PARTHEO: implementation on parallel computers

E, 2000

[Sch02] I Modern equality prover I S. Schulz: Munich Technical University Homepage of E: E is a a purely equational theorem prover for clausal logic. That means it is a program that you can stu a mathematical specication (in clausal logic with equality) and a hypothesis into, and which will then run forever, using up all of your machines resources. Very occasionally it will nd a proof for the hypothesis and tell you so ;-).

Vampire I I I

Resolution with equality handling A. Voronkov: University of Manchester, England Winner of CADE-20, 2005

Isabelle [NPW02] I I

Interactive prover for higher-order predicate logic T. Nipkov, L. Paulson, M. Wenzel: Univ. Cambridge, Munic Techn. Univ.

Mathematical Examples

Application of E [Sch02] on: Left- and right-neutral elements in a semigroup are equal

Denition A structure (M, ·) consisting of a set M with a two-place inner operation · is called a semigroup if the law of associativity

∀x ∀y ∀z (x · y ) · z = x · (y · z) holds. An element e ∈ M is called left-neutral (right-neutral) if ∀x e · x = x (∀x x · e = x ).

It has to be shown:

Theorem

If a semigroup has a left-neutral element element

er ,

then

el

and a right-neutral

el = er .

Proof, variant 1

intuitive mathematical reasoning Clearly it holds for all x ∈ M that

el · x = x

(9)

x · er = x

(10)

and If we set x = er in Equation ?? and x = el in Equation obtain the two equations el · er = er and el · er = el . Thus: el = el · er = er ,

??,

we

Proof, variant 2

resolution proof manually

(¬ el = er )1 negated query (m(m(x, y ), z) = m(x, m(y , z)))2 (m(el , x) = x)3 (m(x, er ) = x)4 Equality axioms:

(x = x)5 (¬ x = y (¬ x = y (¬ x = y (¬ x = y

∨ ∨ ∨ ∨

y = x)6 ¬ y = z ∨ x = z)7 m(x, z) = m(y , z))8 m(z, x) = m(z, y ))9

(reexivity) (symmetry) (transitivity) Substitution in m Substitution in m,

Proof: Res(3, 6, x6 /m(el , x3 ), y6 /x3 ) : Res(7, 10, x7 /x10 , y7 /m(el , x10 )) : Res(4, 11, x4 /el , x11 /er , z11 /el ) : Res(1, 12, ∅) :

(x = m(el , x))10 (¬ m(el , x) = z ∨ x = z)11 (er = el )12 ().

Proof, variante 3

automated resolution proof with the prover E Transformation in clause normal form language LOP:

(¬A1 ∨ . . . ∨ ¬Am ∨ B1 ∨ . . . ∨ Bn )

7→

B1 ; . . . ;Bn = 2, % Abstand M¨ uller/Schmid >= 2 Huber #= Mathe, % Huber pr¨ uft Mathematik Physik #= 4, % Physik in Raum 4 Deutsch #\= 1, % Deutsch nicht in Raum 1 Englisch #\= 1, % Englisch nicht in Raum 1 nl, write([Maier, Huber, Mueller, Schmid]), nl, write([Deutsch, Englisch, Mathe, Physik]), nl.

Output: [3,1,2,4] [2,3,1,4]

Room plan: Room num. Teacher Subject

1 Hoover Math

2 Miller German

Finite domain constraint solver

Übung: Einstein puzzle

3 Mayer English

4 Smith Physics

Summary

I I I I I I I

Unication, lists, declarative programming Relationale view on procedures Parameters for input and output short programmes Tool for Rapid Prototyping CLP for optimization and planning tasks and logic puzzles PROLOG in Europe, LISP in USA

Literature:

[Bra86] und [CM94], Handbooks: [Wie04; Dia04], CLP: [Bar98]

Teil VI

Search, Games and Problem Solving Introduction Uninformed Search Heuristic Search Games with Opponents State of the Art

A heavily trimmed search tree or: Where is my cat?

A Search Tree

hP (((hh P P P

hh hX ( P(( X P P

((hhhh (h hh ((( (` ` X X P

(((hhh ( h `` X X

X X P

*

Search tree for simple SLD resolution proof (depth bound 14)

Example: Chess I I I I I I I

Branching factor b = 30, depth d = 50: 3050 ≈ 7.2 · 1073 leaves. 50 X 1 − 3051 Number of inference steps = 30d = = 7.4 · 1073 , 1 − 30 d=0 10000 computers each CPU: one billion inferences per second parallelization without loss computation time:

7.4 · 1073 inferences = 7.4 · 1060 sec ≈ 2.3 · 1053 years, 10000 · 109 inferences/sec I

1043 times age of the universe

Questions:

I

I

Why do good chess players exist and nowadays also good chess computers? Why do mathematicians nd proofs for propositions in which the search space is even larger?

The 8-Puzzle 2

5

1

4

8

7

3

6

→

1 4 5 7 8

.. .

2 5 3 8 6

.. .

1 5 2 4 3 7 8 6

.. .

.. .

.. .

.. .

3

4

5

6

7

8

1 2 4 5 3 7 8 6

1 5 2 4 8 3 7 6

.. .

2

2 3 6

1 5 2 4 3 7 8 6

1 4 7

1

.. .

1 7

.. .

.. .

5 2 4 3 8 6

.. .

4 1 2 5 3 7 8 6

.. .

.. .

.. .

1 4 7

1 2 4 5 3 7 8 6

.. .

.. .

.. .

average branching factor =

1 4 7

.. .

√

.. .

2 5 3 8 6

2 5 3 8 6

.. .

1 4 7

.. .

8 ≈ 2.83

.. .

.. .

2 3 5 8 6

1 2 3 4 5 6 7 8

Denition The average branching factor of a tree is dened as the branching factor that a tree with constant branching factor, equal depth, and equal number of leaf nodes would have (see Aufgabe ??).

1 4 5 7 8

2 3 6

1 5 2 4 3 7 8 6

1 5 2 4 3 7 8 6

1 5 2 4 8 3 7 6

1 5 2 4 3 7 8 6

1 5 4 3 2 7 8 6

1 5 2 4 3 6 7 8

1 5 2 4 8 3 7 6

1 5 2 4 8 3 7 6

5 2 1 4 3 7 8 6

1 5 2 7 4 3 8 6

.. .

.. .

.. .

.. .

.. .

.. .

1 2 4 5 3 7 8 6

1 2 4 5 3 7 8 6

4 1 2 5 3 7 8 6

1 2 3 4 5 7 8 6

4 1 2 5 3 7 8 6

.. .

.. .

4 1 2 7 5 3 8 6

.. .

.. .

1 2 4 7 8

.. .

.. .

3 5 6

1 2 3 4 5 6 7 8

.. .

average branching factor ≈ 1.89 9 For an 8 puzzle the average branching factor depends on the start state.

Denition A search problem is dened by the following values State: description of the state of the world in which the agent nds itself. Starting state: the initial state in which the search agent is started. Goal state: if the agent reaches a goal state, then it terminates and outputs a solution (if desired). Actions: all of the agents allowed actions. Solution: the path in the search tree from the starting state to the goal. Cost function: assigns a cost value to every action. Necessary for nding a cost-optimal solution. State space: set of all states. Search tree: States are leaves, actions are edges.

Applied to the 8 puzzle, we get

3 × 3 matrix S with the values 1,2,3,4,5,6,7,8 (once each) and one empty square. Starting state: An arbitrary state. Goal state: An arbitrary state, e.g. the state given to the right in Abbildung ?? Actions: Movements of the empty square Sij to the left (if j 6= 1), right (if j 6= 3), up (if i 6= 1), down (if i 6= 3). Cost function: The constant function 1, since all actions have equal cost. State space: The state space is degenerate in domains that are mutually unreachable (Aufgabe ??). Thus there are unsolvable 8 puzzle problems. State:

For analysis of the search algorithms, the following terms are needed:

Denition • The number of successor states of a state s is called the branching factor (engl. branching factor) b(s), or b if the branching factor is constant. • The eective branching factor of a tree of depth d with n total nodes is dened as the branching factor that a tree with constant branching factor, equal depth, and equal n would have (see Aufgabe ??). • A search algorithm is called complete if it nds a solution for every solvable problem. If a complete search algorithm terminates without nding a solution, then the problem is unsolvable.

Solving the equation

n=

d X i=0

bi =

b d+1 − 1 b−1

for b yields the eective branching factor

Theorem For heavily branching nite search trees with a large constant branching factor, almost all nodes are on the last level.

Proof: (Aufgabe

??).

Example Shortest path from city Frankfurt

111

85 67 Karlsruhe

A

to city

Würzburg

Bayreuth 104

183

140 64

Stuttgart

Ulm

107

170 123

85

115

München 59

Rosenheim 81

184

126

Salzburg

93

Zürich

91

Passau 102

189

55 Memmingen

Basel

220

171

191

Bern

75 Nürnberg

230 Mannheim

B

120 Landeck

73

Innsbruck

The graph of southern Germany with cost function.

Linz

A city as the current location of the traveler. Starting state: An arbitrary city. Goal state: An arbitrary city. Actions: Travel from the current city to the neighboring city. Cost function: The distance between the cities. State space: All cities, i.e. nodes of the graph. State:

Denition A search algorithm is called optimal if it, if a solution exists, always nds the solution with the lowest cost. The 8 puzzle problem is I deterministic:

I I

I

every action leads from a state to a unique

successor state. observable: the agent always knows which state it is in. Using so-called oine algorithms, optimal solutions can be found. Otherwise: Reinforcement learning

Uninformed Search

Breadth-rst search

1 2 5

6

3 7

8

9

4 10

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

11 12

Breadth-rst search Breadth-rst-search(NodeList, Goal) NewNodes = ∅ For all Node ∈ NodeList If Goal_Reached(Node, Goal) Return(Solution found, Node) NewNodes = Append(NewNodes, Successors(Node)) If NewNodes 6= ∅ Return(Breadth-rst-search(NewNodes, Goal)) Else Return(No

I I

solution)

The algorithm is generic Application-specic functions GoalReached und Successors

Analysis

I I I

complete. optimal, if all costs of all actions are the same (see Exercise). Computation time =

c· I

d X

bi =

i=0

b d+1 − 1 = O(b d ). b−1

Memory space = O(b d ). the node with the lowest cost from the ascendingly sorted list of nodes is always expanded. Always optimal!

I Uniform Cost Search:

Depth-rst search

1

1

2 3 4

2

1 3 4

1

2

5 6 7 8

5

3 4

1 2 8 18 19 20

1 3 4

3 21 22

2

6 7 8

6

9 10 11

3 4 7 8

1 4

3 21 23 24 25

22

3

15 16 17 1

4

2 7

12 13 14 1

4

1

4

22

29

26 27 28

30 31 32

3 4 8

Depth-rst search Depth-rst-search(Node,Goal) GoalReached(Node,Goal) Return(Solution found) NewNodes = Successors(Node) While NewNodes 6= ∅ Result = Depth-rst-search(First(NewNodes), Goal) If Result = Solution found Return(Solution found) NewNodes = Rest(NewNodes) Return(No solution) If

The algorithm for the depth-rst search.

Analysis

I I I I

incomplete. not optimal Computation time = O(b d ) Memory requirement= O(bd)

Iterative Deepening 1 2

3

4

5

6

7

depth limit

IterativeDeepening(Node, Goal) DepthLimit = 0 Repeat

Result = DepthFirstSearch-B(Node, Goal, 0, DepthLimit) DepthLimit = DepthLimit + 1 Until Result = Solution found

Depth-First Search with Bound

DepthFirstSearch-B(Node, Goal, Depth, Limit) GoalReached(Node,Goal) Return(Solution found) NewNodes = Successors(Node) While NewNodes 6= ∅ And Depth < Limit Result = DepthFirstSearch-B(First(NewNodes), Goal, Depth + 1, Limit) If Result = Solution found Return(Solution found) NewNodes = Rest(NewNodes) Return(No Solution found) If

Analysis

I I I I

complete. optimal, if costs are constant and increment = 1 computation time = O(b d ) Memory requirements = O(bd)

Loss from repeated computations:

Nb (dmax ) =

dmax X i=0

bi =

b dmax +1 − 1 b−1

dmax X−1 d=1

! ! dmax X−1 1 b d+1 − 1 d+1 − dmax + 1 Nb (d) = = b b−1 b−1 d=1 d=1 ! ! dmax +1 dmax X 1 1 b −1 d = − dmax + 1 = b − b−1 b−1 b−1 d=2 dmax +1 1 b −1 1 ≈ = Nb (dmax ) b−1 b−1 b−1 dmax X−1

For b = 20 the rst dmax − 1 trees together contain only about 1 = 1/19 of the number of nodes in the last tree. b−1

Comparison

Breadth-rst Uniform Depth-rst- Iterative search Cost Search search Deepening Completeness yes yes no yes Optimal solution yes (*) yes no yes(*) d d ds Computation time b b ∞ or b bd Memory use bd bd bd bd

(*): only true with constant action cost. ds is the maximal depth for a nite search tree.

Heuristic Search

are problem-solving strategies which in many cases nd a solution faster than uninformed search. There is no guarantee! In everyday life, heuristic methods are important.

I Heuristics

I I

I Realtime-decisions under limited resources I

A good solution found quickly is preferred over a solution that is optimal, but very expensive to derive.

Mathematical modeling

f (s) for states + heuristic evaluation + ...

I heuristic evalutation function I Node

=

state

The Algorithm HeuristicSearch(Start, Goal) NodeList = [Start] While True If NodeList= ∅ Return(No solution) Node = First(NodeList) NodeList = Rest(NodeList) If Goal_Reached(Node,Goal) Return(Solution found, Node) NodeList = SortIn(Successors(Node),NodeList) 10

Depth-rst and breadth-rst search are special cases of the function

HeuristicSearch!

10 When sorting in a new node from the node list, it may be advantageous to check whether the node is already available and, if so, to delete the duplicate.

Remarks

I

I

The ideal heuristic would be a function that calculates the actual costs from each node to the goal. In the real world: cost estimate function h

He: Dear, think of the fuel costs! I'll pluck one for you somewhere else. She: No, I want that one over there!

Greedy Search Basel

Frankfurt

111

85 67 Karlsruhe

Würzburg

Bayreuth 104

Mannheim

183

140 64

Stuttgart

Ulm

107

170 123

85

115

München 59

Rosenheim 81

184

93

Zürich

91

Passau 102

189

55 Memmingen

Basel

220

171

191

Bern

75 Nürnberg

230

120 Landeck

73

Innsbruck

Salzburg

126

Linz

204

Bayreuth

207

Bern

247

Frankfort

215

Innsbruck

163

Karlsruhe

137

Landeck

143

Linz

318

München

120

Mannheim

164

Memmingen

47

Nürnberg

132

Passau

257

Rosenheim

168

Stuttgart

75

Salzburg

236

Würzburg

153

Zürich

157

Cost estimate function h(s) = ying distance from city s to Ulm.

Linz h = 318

Mannheim h = 164

Passau Salzburg h = 257 h = 236 Linz h = 318

Rosenheim h = 168

Innsbruck h = 163

M¨ unchen h = 120

Karlsruhe h = 137

N¨ urnberg h = 132

W¨ urzburg Mannheim Ulm h = 153 h = 164 h=0

M¨ unchen Passau h = 120 h = 257

Salzburg h = 236

Memmingen Ulm N¨ urnberg Passau Rosenheim h = 47 h = 0 h = 132 h = 257 h = 168 I I

Frankfurt h = 215

MannheimNürnbergUlm: 401 km MannheimKarlsruheStuttgartUlm: 323 km

Bayreuth h = 207

A? -Search Cost function

g (s) = Sum of accrued costs from root to current node, Heuristic estimate heuristic

h(s) = Estimated cost from current node to goal evaluation function f (s) = g (s) + h(s).

Requirement:

Denition A heuristic cost estimate function h(s) that never overestimates the actual cost from state s to the goal is called admissible. = HeuristicSearch with evaluation function f (s) = g (s) + h(s) and admissible heuristic h

A? -Algorithm

Frankfurt (0) 0,215,215

W¨ urzburg (1) 111,153,264

Mannheim (2) 85,164,249

Frankfurt (3) Karlsruhe (4) N¨ urnberg (5) 170,215,385 152,137,289 315,132,447 Frankfurt (0) 0,215,215

W¨ urzburg (1) 111,153,264

Mannheim (2) 85,164,249

Frankfurt (6) Stuttgart (7) Ulm (8) N¨ urnberg (9) 222,215,437 251,75,326 294,0,294 215,132,347

Karlsruhe (3) 152,137,289

Mannheim (10) 219,164,383

Stuttgart (11) 216,75,291

N¨ urnberg (4) Frankfurt (5) 315,132,447 170,215,385

Basel (12) 343,204,547

Karlsruhe (13) W¨ urzburg (14) Ulm (15) 280,137,417 356,153,509 323,0,323

In the boxes below the names of the city

s

we show

g (s), h(s), f(s).

Numbers in parentheses after the city names show the order in which the nodes have been generated.

Theorem The A

?

algorithm is optimal. That is, it always nds the

solution with lowest total cost if the heuristic

h

is admissible.

Proof:

Solution node

• l

s

f (l) ≤ f (s)

l0

f (s) ≤ g (l 0 )

The rst solution node l found by A? never has a higher cost than another arbitary node l 0 .

g (l) = g (l) + h(l) = f (l) ≤ f (s) = g (s) + h(s) ≤ g (l 0 ).

IDA? -Search

Weaknesses of A? : I I

High memory requirements List of open nodes must be sorted ⇒ Heapsort

Solution: Iterative Deepening I I

The same as depth-rst search, but limit for heuristic evaluation f (s).

Route Planning with A? -Search11 I I I I

Simple heuristic: air line distance Better: Landmarks 5 60 randomly selected cities Preprocessing: For all landmarks the shortest paths to all nodes are stored.

Let: l : landmark, s : current node, z : goal node, c ? (x, y ) = shortest path from x to y Triangle inequality:

c ? (s, l) ≤ c ? (s, z) + c ? (z, l).

Solving for c ? (s, z) yields

h(s) = c ? (s, l) − c ? (z, l) ≤ c ? (s, z).

h(s) is admissible. 11 A. Batzill.

Optimal Route Planning on Mobile Systems.

Hochschule Ravensburg-Weingarten. 2016.

Masterarbeit,

A? with: no heuristic (red), air line distance (d. green), landmark heuristic (blue)

Results12

no heuristic air line distance landmark heuristic

unidirectional tree size run time [nodes] [msec.] 62000 192 9380 86 5260 16

bidirectional tree size run time [nodes] [msec.] 41850 122 12193 84 7290 16

Advantages of the Landmark Heuristic: I I

Smaller search space than air line heuristic Fast computation: only table lookup

12 A. Batzill.

Optimal Route Planning on Mobile Systems.

Hochschule Ravensburg-Weingarten. 2016.

Masterarbeit,

Optimization of time, not distance

I I I

d(s, z) is replaced by t(s, z) = d(s, z)/vmax air line heuristic estimate becomes worse landmark heuristic is still very good

Further Improvement

I

Contraction hierarchies13

13 R. Geisberger u. a. Contraction hierarchies: Faster and simpler hierarchical routing in road networks. S. 319333.

In:

Experimental Algorithms.

Springer, 2008,

Comparison of search algorithms Admissible heuristics for the 8 puzzle: I I

h1 counts the number of squares that are not in the right place h2 measures the Manhattan distance

Example:

Distance between

h1 (s) = 7,

2

5

1

4

8

7

3

6

and

1

2

3

4

5

6

7

8

h2 (s) = 1 + 1 + 1 + 1 + 2 + 0 + 3 + 1 = 10

h1 and h2 are admissible!

Comparison of search algorithms

I I

Implemented in Mathematica Averaged over 132 randomly generated 8-puzzle problems

Comparison ?

A -Algorithm

Iterative Deepening Heuristic Depth

Steps

2

20

4

81

6

806

8

6455

10

50512

12 486751

Time [sec]

Steps

0.003 0.013 0.13 1.0 7.9 75.7

3.0 5.2 10.2 17.3 48.1 162.2

h1

Heuristic

Time [sec]

0.0010 0.0015 0.0034 0.0060 0.018 0.074 IDA

14

16

18

− − −

10079.2 69386.6 708780.0

Steps

2.6 19.0 161.6

3.0 5.0 8.3 12.2 22.1 56.0

h2 0.0010 0.0022 0.0039 0.0063 0.011 0.031

Heuristic h1 : 1.5

10 24 19 14 15 12

?

855.6 3806.5 53941.5

0.25 1.3 14.1

eective branching factor: uninformed search: 2.8

Num

Time [sec] Runs

Heuristic h2 : 1.3

16 13 4

Summary

I

I I I

For uninformed search, only iterative deepening is practically useful. Why? IDA? is complete, fast and memory ecient good heuristics greatly reduce the eective branching factor When the problem is unsolvable, what does a heuristic help?

How to nd good heuristics? I I

Manually, e.g. by simplication of the problem Automatic generation of heuristics by machine-learning techniques

Games with Opponents

I I I I I

Games for two players Chess, Checkers, Othello, Go deterministic, observable Card games: only partially observable, why? Zero-sum games: Win + Loss = 0

Minimax Search Characteristics of games: I I I I I I I I I

For chess, the average branching factor is around 30 to 35 50 moves per player: 30100 ≈ 10148 leaf nodes Real-time requirements Limited search depth heuristic evaluation function b , Player: Max, Opponent: Min Assumption: Opponent Min always makes the best move he can Max maximize the evaluation of his moves Min minimizes evaluation of his moves

A minimax game tree with look-ahead of 4 half-moves. 3

Max 3

Min 6

Max Min Evaluation

2

0

1

3 6

3

2 1

2

4 2

4

6 1

6

4

0 7 9 1 6 7 3 4 1 5 8 9 2 2 3 4 5 1 2 7 6 9 4

Alpha-Beta-Pruning An alpha-beta game tree with look-ahead of 4 half-moves. 3

Max

6

Max Min Evaluation I I

≤2 b

3

Min

0

1

3 6

3

0 7 9 1 6 7 3 4 1

2 ≤1 a

2

≤2 c

9 2 2

At every leaf node the evaluation function is calculated. For every maximum node the current largest child value is saved in α.

I

I

I

For every minimum node the current smallest child value is saved in β . If at a minimum node k the current value β ≤ α, then the search under k can end. Here α is the largest value of a maximum node in the path from the root to k . If at a maximum node l the current value α ≥ β , then the search under l can end. Here β is the smallest value of a minimum node in the path from the root to l .

AlphaBetaMax(Node, α, β ) DepthLimitReached(Node) Return(Rating(Node)) NewNodes = Successors(Node) While NewNodes 6= ∅ α = Maximum(α, AlphaBetaMin(First(NewNodes), α, β ) If α ≥ β Return(β ) NewNodes = Rest(NewNodes) Return(α) If

AlphaBetaMin(Node, α, β ) DepthLimitReached(Node) Return(Rating(Node)) NewNodes = Successors(Node) While NewNodes 6= ∅ β = Minimum(β , AlphaBetaMax(First(NewNodes), α, β ) If β ≤ α Return(α) NewNodes = Rest(NewNodes) Return(β )

If

I

I

The algorithm is an extension of depth-rst search with two functions for maximum- und minimum nodes which call themselves mutually. It uses the values dened above for α and β .

Complexity heavily depends on the order in which child nodes are traversed Worst-Case: does not oer any advantage, i.e. nd = b d Best-Case: Successors of maximum nodes are sorted in descending order, successors of minimum nodes are sorted in ascending order: √ Eective branching factor ≈ b . √ d nd = b = b d/2 Chess: branching factor reduces from 35 to about 6 Search horizon is doubled. Average-Case: branching factor ≈ b 3/4 Chess: branching factor reduces from 35 to about 14. ⇒ 8 half-moves ahead instead of 6 Heuristic node order: branching factor ≈ 7 bis 8

Computation time

Non-deterministic Games

I I I

e.g. dice games Max, dice, Min, dice, . . . , Average the values of all rolls

Heuristic Evaluation Functions I I

List of relevant features or attributes linear evaluation function B(s)

B(s) = a1 · material + a2 · pawn_structure + a3 · king_safety + a4 · knight_in_center + a5 · bishop_diagonal_coverage + (11) material = material(own_team) − material(opponent)

material(team) = num_pawns(team) · 100 + num_knights(team) · + num_bishops(team) · 300 + num_rooks(team) + num_queens(team) · 900 + . . . Weights ai are set intuitively after discussion with experts, better: optimizing weights by machine-learning methods

Learning of Heuristics

I I

I I

Expert is only asked about relevant features f1 (s), . . . , fn (s). machine-learning prozess is used for nding an optimal evaluation function B(f1 , . . . , fn ) Evaluation at the end, e.g.: victory, defeat, or draw learning algorithm updates evaluation function

Problem:

I Credit Assignment I I I

no ratings for individual moves positive or negative feedback only at the end Feedback for actions of the past?

young eld: Reinforcement Learning (engl. reinforcement learning) (see Sec. ). Most of the world-best chess computers still work without machine-learning techniques. Reasons: I I

Reinforcement learning has large computation times Manually created heuristics are already heavily optimized.

Latest research I Denition by Elaine Rich : Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better. Direct comparison of computer and human in a game

1950

Claude Shannon, Konrad Zuse, John von Neumann,

rst chess computers 1955 Arthur Samuel: Program that learns to play checkers on a IBM 701. archived games, every individual move had been rated by experts. Program plays against itself. Credit Assignment: For each individual position during a game it compares the evaluation by the function B(s) with the one calculated by alpha-beta pruning and changes B(s) accordingly. Alan Turing:

Latest research II 1961 Samuels checkers program beat the fourth-best checkers player in the USA. 1990 Tesauro: Reinforcement learning. learning backgammon program named TD-Gammon, which played at the world champion level (see Sec. ). 1997 IBM's Deep Blue defeated the chess world champion Gary Kasparov with a score of 3.5 games to 2.5 Deep Blue could on average compute 12 half-moves ahead with alpha-beta pruning. 2004 Hydra: Chess computer on parallel machine Uses 64 parallel Xeon processors with 3 GHz computing power and 1 GByte memory each. Software: Ch. Donninger (Austria) and U. Lorenz, Ch. Lutz (Germany).

Latest research III Position evaluation: FPGA Co-processor (Field Programmable Gate Arrays). Evaluates 200 million possitions per second. Hydra can on average compute about 18 half-moves ahead. Hydra often makes moves which grand champions cannot comprehend, but which in the end lead to victory. Alpha-beta search with relatively general, well-known heuristics and a good hand-coded position evaluation. Hydra works without learning. 2009 Pocket Fritz 4, running on a PDA, won the Copa Mercosur chess tournament in Buenos Aires: 9 wins and 1 draw against 10 excellent human chess players, three of them grandmasters. search engine HIARCS 1314 searches less than 20,000 positions per second

Latest research IV

Pocket Fritz 4 is about 10,000 slower than Hydra HIARCS = Higher Intelligence Auto Response System.

14 HIARCS =

Higher Intelligence Auto Response Chess System.

Chess

Today's challenge: Go

I I I I I I I

square board with 361 squares 181 white, 180 black stones average branching factor: about 300 after 4 half-moves: 8 · 109 positions classical game tree search processes have no chance! Pattern recognition on the board! Humans are still miles ahead of computer programs.

Teil VII

Reasoning with Uncertainty Reasoning with Uncertainty The Maximum Entropy Method LEXMED, a medical expert system for the diagnosis of appendicitis Reasoning with Bayesian Networks Summary

Flying Penguin 1. Tweety is a penguin 2. Penguins are birds 3. All birds can y Formalized in PL1 the knowledge base WB consists of penguin(tweety)

⇒ bird(x) bird(x) ⇒ y(x)

penguin(x)

It can be derived: 15 see

Sec.

y(tweety)

15

The ying penguin

New trial:

Penguins cannot y penguin(x)

⇒ ¬ y(x)

It can be derived: ¬y(tweety) But: It can also be derived: y(tweety) I I

The knowledge base is inconsistent. The logic is monotonic: I

new knowledge can not void old knowledge

Probabilistic logic

Uncertainty:

99% of all birds can y

Agent has incomplete information about the state of the world (real-time decisions)

Incompleteness:

Heuristic search Reasoning with uncertain or incomplete knowledge

Let's just sit back and think about what to do!

Example:

If a patient experiences pain in the right lower abdomen and a raised white blood cell (leukocyte) count, this raises the suspicion that might be appendicitis.

Stomach pain right lower ∧ Leukocytes > 10000 → Appendicitis On Stomach pain right lower ∧ Leukocytes > 10000 we can use modus ponens to derive Appendicitis.

MYCIN

I I I

1976, Shortlie and Buchanan Certainty factors represent the certainty of facts and rules A →β B via conditional probability β

Example: Stomach pain right lower

∧

Leukocytes

Appendicitis

I I I

> 10000 →0.6

Formulas for connecting the factors of rules Calculus is incorrect Inconsistent results could be derived

Other formalisms for modeling uncertainty

I I I

I

non-monotonic logics Default logic Dempster-Shafer theory: assigns a belief function Bel(A) to a logical term A Fuzzy logic ( ⇒ control theory)

Reasoning with conditional probabilities

I

I I I I I

Conditional probabilities instead of implication (material implication) Subjective probabilities Probability theory is well-founded Reasoning with uncertain and incomplete knowledge Maximum entropy method (MaxEnt) Bayesian networks

Calculating with Probabilities

Example I In dice games, the probability of throwing a six is I The probability of throwing an odd number is

1/6

1/2

Denition Let Ω be the set of possible outcomes of some events. Each ω ∈ Ω stands for a possible outcome of the experiment. If the wi ∈ Ω exclude each other, but cover all possible outcomes, they are called elementary events.

I

Throwing a die once:

Ω = {1, 2, 3, 4, 5, 6} I I

I I I I I

Throwing an even number {2, 4, 6} is not an elementary event Throwing a number smaller than 5 {1, 2, 3, 4} is not an elementary event Reason: {2, 4, 6} ∩ {1, 2, 3, 4} = {2, 4} = 6 ∅ With two events A and B , A ∪ B is an event. Ω is the sure event The empty set ∅ is the impossible event Instead of A ∩ B we write A ∧ B , because

x ∈ A ∩ B ⇔ x ∈ A ∧ x ∈ B.

Notation A∩B A∪B ¯ A Ω ∅ I I

I I

Propositional logic A∧B A∨B ¬A w f

Description Intersection / and Union / or Complement / negation Sure event / true Impossible event / false

A, B , etc.: random variables We consider only discrete random variables with nite value range At dicing, the number is discrete with the values 1,2,3,4,5,6. The probability, to throw a 5 or 6 is 1/3: P(number ∈ {5, 6}) = P(number = 5 ∨

number

= 6) = 1/3.

Denition Let Ω = {ω1 , ω2 , . . . , ωn } be nite. No elementary event is preferred, that means we assume a symmetry regarding the frequency of occurence of all elementary events. The probability P(A) of the event A is then dened by

P(A) =

|A| Number of outcomes favourable to A = |Ω| Number of possible outcomes

Example Throwing a die, the probability for an even number is

P(number ∈ {2, 4, 6}) =

|{2, 4, 6}| 3 1 = = . |{1, 2, 3, 4, 5, 6}| 6 2

I

I I I I I

Any elementary event has the probability 1/|Ω| (Laplace assumption) Applicable only at nite event sets Example: color of eyes with the values green, blue, brown color of eyes = blue describes the value of a variable binary (boolean) variables are propositions themselves Example: P(JohnCalls) instead of P(JohnCalls = t)

From this denition, some rules follow directly:

Theorem

1. P(Ω) = 1. 2. P(∅) = 0, i.e. the impossible event has probability 3. For pairwise inconsistent events A and B it holds P(A ∨ B) = P(A) + P(B). 4. For two complementary events A and ¬A it holds P(A) + P(¬A) = 1. 5. For arbitrary events A and B it is true that P(A ∨ B) = P(A) + P(B) − P(A ∧ B). 6. For A ⊆ B we have P(A) ≤ P(B). 7. If PAn 1 , . . . , An are the elementary events, it holds i=1 P(Ai ) = 1.

Proof as exercise

0.

Joint probability distribution for

P(A, B)

A, B

= (P(A, B), P(A, ¬B), P(¬A, B), P(¬A, ¬B))

Distribution in matrix form: P(A, B)

A=w A=f

P(A, B) = P(A ∧ B)

B=w B=f P(A, B) P(A, ¬B) P(¬A, B) P(¬A, ¬B)

Joint probability distribution

I I I I I I

d variables X1 , . . . , Xd with n values each: The distribution contains the values P(X1 = x1 , . . . , Xd = xd ) x1 , . . . , xd each may have n dierent values Creating a d -dimensional matrix with nd elements One of the nd values is redundant Distribution is characterized uniquely by nd − 1 values

Conditional probabilities

Example In the Doggenriedstraÿe in Weingarten velocities of 100 vehicles are measured.

Event

Frequency

Vehicle observed Driver is a student (S ) Velocity too high (V ) Driver is student and velocity too high (S

∩ V)

Relative freq.

100

1

30

0.3

10

0.1

5

0.05

Do students speed more frequently than the average person, or than non-students? Answer: conditional probability

P(V |S) =

|Driver

is a student and velocity too high|

|Driver

is a student|

=

5 1 = ≈ 0.17 30 6

Denition For two events A and B , the probability for A under the condition B (conditional probability) is dened by

P(A|B) =

P(A ∧ B) P(B)

P(A|B) = probability of A regarding event B only, i.e. P(A|B) =

|A ∧ B| . |B|

Proof:

P(A ∧ B) P(A|B) = = P(B)

|A ∧ B| |Ω| |B| |Ω|

=

|A ∧ B| . |B|

Denition If for two events A and B

P(A|B) = P(A), then these events are called independent.

Theorem For independent events

A

and

B,

it follows from the denition

that

P(A ∧ B) = P(A) · P(B). Proof?

Example The probability for two sixes is

1/36

if the dice are independent,

because

P(W1 = 6 ∧ W2 = 6) = P(W1 = 6) · P(W2 = 6) = If die 2 always falls the same as die 1, it holds

1 P(W1 = 6 ∧ W2 = 6) = . 6

1 1 1 · = , 6 6 36

Chain rule

Product rule:

P(A ∧ B) = P(A|B)P(B)

Chain rule: P(X1 , . . . , Xn )

= = = =

P(Xn |X1 , . . . , Xn−1 ) · P(X1 , . . . , Xn−1 )

P(Xn |X1 , . . . , Xn−1 ) · P(Xn−1 |X1 , . . . , Xn−2 ) · P(X1 , . . . , Xn−2 )

P(Xn |X1 , . . . , Xn−1 ) · P(Xn−1 |X1 , . . . , Xn−2 ) · . . . · P(X2 |X1 ) · P(X1 n Y i=1

P(Xi |X1 . . . , Xi−1 ),

(12

Marginalization Binary variables A and B behave

P(A) = P((A ∧ B) ∨ (A ∧ ¬B)) = P(A ∧ B) + P(A ∧ ¬B). In general:

P(X1 = x1 , . . . , Xd−1 = xd−1 ) =

X

P(X1 = x1 , . . . , Xd−1 = xd−1 , Xd =

xd

I

The resulting distribution

P(X1 , . . . , Xd−1 )

distribution I

Projection of a cube on one edge (margin)

is called

marginal

Example Leuko App

: Leukocyte value higher than 10000 : Patient has appendicitis (appendix inammation),

P(App, Leuko) App

¬App

total

0.31

0.54

Leuko

0.23

¬Leuko

0.05

0.41

0.46

0.28

0.72

1

total

For example, it holds:

P(Leuko) = P(App, Leuko) + P(¬App, Leuko) = 0.54. P(Leuko|App) =

P(Leuko, App) = 0.82 P(App)

Bayes' Theorem

P(A|B) =

P(A ∧ B) P(B)

as well as

P(B|A) =

P(A ∧ B) P(A)

Bayes' theorem:

P(A|B) =

P(B|A) · P(A) P(B)

(13)

Appendicitis example:

P(App|Leuko) would be much more interesting for the diagnosis of appendicitis, but is not published! I Why is P(Leuko|App) published, but P(App|Leuko) not? 0.82 · 0.28 P(Leuko|App) · P(App) P(App|Leuko) = = = 0.43 P(Leuko) 0.54 (14)

I

Bayes' Theorem: Example

I I I

Very reliable burglar alarm Reports any burglar with 99% certainty Thus with high certainty: If alarm then burglary!

Bayes' Theorem: Example

I I I I I

Very reliable burglar alarm Reports any burglar with 99% certainty Thus with high certainty: If alarm then burglary! No! P(A|E ) = 0.99, P(A) = 0.1, P(E ) = 0.001

I

P(E |A) =

P(A|E )P(E ) 0.99 · 0.001 = = 0.01 P(A) 0.1

The Maximum Entropy Method

I I I

Calculus for reasoning under uncertainty Often too little knowledge for solving the necessary equations Idea from E.T. Jaynes (Physicist): Maximize the entropy of the sought probability

[Jay57,Jay03] [Che83,Nil86,Kan89,KK92] Application to the LEXMED project distribution!

I

An inference rule for probabilities Modus Ponens:

A, A → B B

Generalization to probability rules

P(A) = α, P(B|A) = β P(B) = ? Given:

two probability values α, β ,

sought:

P(B)

Marginalization:

P(B) = P(A, B) + P(¬A, B) = P(B|A) · P(A) + P(B|¬A) · P(¬A). With classical probability theory only: P(B) ≥ P(B|A) · P(A).

Distribution P(A, B)

= (P(A, B), P(A, ¬B), P(¬A, B), P(¬A, ¬B))

Abbreviation

p1 p2 p3 p4

I I I

= = = =

P(A, B) P(A, ¬B) P(¬A, B) P(¬A, ¬B)

These four parameters (unknowns) dene the distribution. Out of it, any probability for A and B can be calculated. Four equations are required.

Normalization condition: p1 + p2 + p3 + p4 = 1

P(A, B) = P(B|A) · P(A) = αβ P(A) = P(A, B) + P(A, ¬B) System of equations:

p1 = αβ p1 + p2 = α p1 + p2 + p3 + p4 = 1 (??) in (??): (??) in (??):

p2 = α − αβ = α(1 − β) p3 + p4 = 1 − α

One equation is missing!

(15) (16) (17) (18) (19)

Solving an optimization problem Sought: Distribution p = (p3 , p4 ), which maximizes the entropy

H(p) = −

n X i=1

pi ln pi = −p3 ln p3 − p4 ln p4

under the constraint p3 + p4 = 1 − α (Equation

??).

Why should the entropy function be maximized? I I

I

The entropy measures the uncertainty of a distribution. Negative entropy is a measure for the information content of the distribution. Maximizing the entropy minimizes the information content of the distribution.

Two-dimensional entropy function with the constraint p3 + p4 = 1. 1

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

Maximizing the entropy I I

Constraint: p3 + p4 − 1 + α = 0 Method of Lagrange multipliers [BHW89]

Lagrange function:

L = −p3 ln p3 − p4 ln p4 + λ(p3 + p4 − 1 + α). ∂L = − ln p3 − 1 + λ = 0 ∂p3 ∂L = − ln p4 − 1 + λ = 0 ∂p4 p3 = p4 =

1−α . 2

P(B) = P(A, B)+P(¬A, B) = p1 +p3 = αβ +

1−α 1 1 = α(β − )+ 2 2 2

1 1 P(B) = P(A)(P(B|A) − ) + . 2 2

Substituting α and β :

P(B) = P(A) ( P(B|A) - 1/2 ) + 1/2 1

P(B|A)=1 P(B|A)=0.9

0.8

P(B|A)=0.8 P(B|A)=0.7 P(B|A)=0.6

P(B)

0.6

P(B|A)=0.5 0.4

P(B|A)=0.4 P(B|A)=0.3

0.2

P(B|A)=0.2 P(B|A)=0.1

0

P(B|A)=0 0

0.2

0.4

0.6 P(A)

0.8

1

Theorem 16

Let there be a consistent

set of linear probabilistic equations.

Then there exists a unique maximum for the entropy function with the given equations as constraints. The MaxEnt distribution thereby dened has minimum information content under the constraints.

I

I

There is no other distribution, which satises the constraints while having lower entropy. A calculus, which leads to distributions with a higher entropy is adding informations ad hoc, which again is not justied.

16 A set of probabilistic equations is called consistent if there is at least one solution, that is, one distribution which satises all equations.

I I I

p3 and p4 always occur symmetrically Therefore, p3 = p4 (indierence) In general:

Denition If an arbitrary exchange of two or more variables in the Lagrange equations results in equivalent equations, these variables are called indierent.

Theorem If a set of variables

{pi1 , . . . pik }

is indierent, then the

maximum of the entropy under the given constraints is at the point where

pi1 = pi2 = . . . = pik .

Maximum Entropy Without Explicit Constraints

I I

I I I

No knowledge given No constraints beside the normalisation condition p1 + p2 + . . . + pn = 1 All varibles are indierent p1 = p2 = . . . = pn = 1/n. (Aufgabe ??) All worlds are equally probable

Special case: two variables

and

A

B

P(A, B) = P(A, ¬B) = P(¬A, B) = P(¬A, ¬B) = 1/4, P(A) = P(B) = 1/2 and P(B|A) = 1/2

1

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

A further example given:

P(B|A) = β P(A, B) = P(B|A)P(A) = βP(A)

and

p1 = β(p1 + p2 ) constraints

βp2 + (β − 1)p1 = 0 p1 + p2 + p3 + p4 − 1 = 0. I I

No symbolic solution! Solving the lagrange equations numerically

0.4 0.35 0.3 0.25 0.2 0.15 p1 p2 p3 p4

0.1 0.05 0 0

0.2

0.4

0.6 P(B|A)

p1 , p2 , p3 , p4 depending on β .

0.8

1

Conditional probability versus material implication

A B A ⇒ B w w w w f f f w w f f w

P(A) P(B) 1 1 1 0 0 1 0 0

P(B|A) 1 0 undened undened

Question:

What value does P(B|A) have, if only P(A) = α and P(B) = γ are given?

Conditional probability versus material implication

p1 = P(A, B),

p2 = P(A, ¬B),

p3 = P(¬A, B),

p4 = P(¬A, ¬B)

constraints (20) (21) (22)

p1 + p2 = α p1 + p3 = γ p1 + p2 + p3 + p4 = 1 Maximizing the entropy (see Aufgabe

p1 = αγ, I I

p2 = α(1 − γ),

??):

p3 = γ(1 − α),

p4 = (1 − α)(1 − γ).

From p1 = αγ follows P(A, B) = P(A) · P(B) Independence of A and B

From the denition

P(B|A) =

P(A, B) P(A)

for P(A) 6= 0 follows

P(B|A) = P(B) For P(A) = 0, P(B|A) stays undened.

A B A ⇒ B w w w f w w

P(A) P(B) α β 0 β

P(B|A) β undened

MaxEnt-Systems I I I

I I

I

Often, MaxEnt optimization has no symbolic solution Therefore: numerical entropy maximization SPIRIT: Fernuni Hagen (distance teaching university) [RM96,BK00] PIT: TU Munich [Sch96,ES99,SE00] PIT uses Sequential Quadratic Programming (SQP) to nd an extremum of the entropy function numerically As input, PIT expects a le with the constraints:

var A{t,f}, B{t,f}; P([A=t]) = 0.6; P([B=t] | [A=t]) = 0.3; QP([B=t]); QP([B=t] | [A=t]);

I

Request QP([B=t]) Web front end on www.pit-systems.de Result: Nr Truthvalue Probability Query 1 UNSPECIFIED 3.800e-01 QP([B=t]); 2 UNSPECIFIED 3.000e-01 QP([A=t]-|> [B=t]);

I

P(B) = 0.38 and P(B|A) = 0.3

I I

The Tweety example P(bird|penguin) = 1 P(ies|bird) ∈ [0.95, 1] P(ies|penguin) = 0

penguins are birds (almost all) birds can y penguins cannot y

PIT input le:

var penguin{yes,no}, bird{yes,no}, flies{yes,no};

P([bird=yes] | [penguin=yes]) = 1; P([flies=yes] | [bird=yes]) IN [0.95,1]; P([flies=yes] | [penguin=yes]) = 0; QP([flies=yes]| [penguin=yes]); Answer: Nr Truthvalue 1 UNSPECIFIED

Probability 0.000e+00

Query QP([penguin=yes]-|> [ies=yes]);

MaxEnt and non monotonicity

I I

I I

I

Probability intervals are often very helpful second rule in the sense of normally birds y: P(ies|bird) ∈ (0.5, 1] MaxEnt enables non monotonic inference MaxEnt is also successful on challenging benchmarks for non monotonic inference [Sch96] Application of MaxEnt within the medical expert system LEXMED

LEXMED, a medical expert system for the diagnosis of appendicitis

Manfred Schramm, Walter Rampf, Wolfgang Ertel Ravensburg-Weingarten University of Applied Sciences, Hospital 14-Nothelfer, Weingarten Technical University Munich [SE00,Le99] LEXMED = Medical expert system capable of learning.

The project was funded by the german state of Baden-Wuerttemberg, the AOK Baden-Württemberg, the Ravensburg-Weingarten University of Applied Sciences and by the hospital 14 Nothelfer in Weingarten.

LEXMED Query form

LEXMED Answer

Result of the PIT diagnosis Diagnosis

App. inamed

App. perforated

Negative

Other

Probability

0.70

0.17

0.06

0.07

Diagnosis of appendicitis with formal methods

I

I I

I

I

Most frequent cause for acute stomach pains [Dom91]: Appendizitis Dicult diagnosis [Ohm95]. Approx. 20% of the surgically removed appendixes without clinical abnormalities There are also cases, where an inamed appendix is not recognized In 1972, de Dombal (Great Britain) developed an expert system for the diagnosis of acute stomach pain [Dom72,Ohm94,Ohm95].

Linear Scores

With n symptoms S1 , . . . , Sn , a score can formally be dened as Appendicitis if w1 S1 + . . . + wn Sn > Θ Diagnosis = Negative else I I I

I

Scores are too weak for the modelling of complex relations Score systems cannot consider contexts E.g. they cannot distinguish between the leukocyte values of elderly and medium age people They demand high requirements on databases (representative)

Hybrid probabilistic knowledge base

Query to the expert system: I

What is the probability for an inamed appendix, if the patient is a 23 year old man with pain in the in downright stomach and a leukocyte value of 13000?

P(Bef4 = inamed ∨

Gender

= perforated | = male ∧ Age ∈ 21-25 ∧

Bef4

Leuko

∈ 12k-15k).

Symptoms Symptom

Values

#

Gender

male, female

2

Age Pain 1st Quad. Pain 2nd Quad. Pain 3rd Quad. Pain 4th Quad. Guarding Rebound tenderness Pain on tappping Rectal pain Bowel sounds Abnormal ultrasound Abnormal urine sedim. Temperature (rectal) Leukocytes Medical ndings

0-5,6-10,11-15,16-20,21-25,2635,36-45,46-55,56-65,65-

yes, no yes, no yes, no yes, no local, global, none yes, no yes, no yes, no weak, normal, increased, none yes, no yes, no

-37.3, 37.4-37.6, 37.7-38.0, 38.138.4, 38.5-38.9, 39.00-6k, 6k-8k, 8k-10k, 10k-12k, 12k15k, 15k-20k, 20k-

inamed, perforated, negative, other

10

Short

Sex2 Age10

6

P1Q2 P2Q2 P3Q2 P4Q2 Gua3 Reb2 Tapp2 RecP2 BowS4 Sono2 Urin2 TRec6

7

Leuko7

4

Bef4

2 2 2 2 3 2 2 2 4 2 2

Knowledge bases

Database

Expert knowledge

Dr. Rampf 15000 patients from Baden-Württemberg 1995

Dr. Hontschik

Knowledge processing physician

query

diagnosis

complete probability distribution

automatic

completion

P(Leuko > 100 | App=positive) = 0.7 almost automatic

database

MaxEnt model

rule base 500 rules

manual

expert

System architecture query/ symptoms

diagnosis user interface diagnosis

symptoms

store cost matrix

weighting

load

patient management

probabilities

runtime system query

answer

PIT

doctor−specif. patient−database (private)

probability distribution experts MaxEnt completion literature

database

rule− induction

rule set

knowledge modelling

Probability distribution

I

Size of the distribution:

210 · 10 · 3 · 4 · 6 · 7 · 4 = 20 643 840 I I

I I

20 643 839 independent values. Any rule set with less than 20 643 839 probability values may not describe the event space completely. A complete distribution is required. A human expert can not deliver 20 643 839 values!

Functionality of LEXMED

Probability statements:

P(Leuko > 20000 |Bef4 = inamed) = 0.09

17

17 Instead of single numerical values, we might also use intervals (i.e.

[0.06, 0.12]).

The dependency graph

Learning the rules by statistical induction

I I

I I I

Estimating the rule probabilities Structure of the dependency graph = structure of the learned rules (as a bayesian network) A priori rules, i.e. Equation ??, Rules with a single condition, i.e. Equation ?? Rules: diagnosis and two symptoms (Equation ??)

P(Bef4 = inamed) = 0.40 (23) P(Sono = ja|Bef4 = inamed) = 0.43 (24) P(S4Q = ja|Bef4 = inamed ∧ S2Q = ja ) = 0.61 (25)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

P([Leuco7=0-6k] P([Leuco7=6-8k] P([Leuco7=8-10k] P([Leuco7=10-12k] P([Leuco7=12-15k] P([Leuco7=15-20k] P([Leuco7=20k-] P([Leuco7=0-6k] P([Leuco7=6-8k] P([Leuco7=8-10k] P([Leuco7=10-12k] P([Leuco7=12-15k] P([Leuco7=15-20k] P([Leuco7=20k-]

| | | | | | | | | | | | | |

[Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ]

* * * * * * * * * * * * * *

[Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25])

= = = = = = = = = = = = = =

[0.132,0.156]; [0.257,0.281]; [0.250,0.274]; [0.159,0.183]; [0.087,0.112]; [0.032,0.056]; [0.000,0.023]; [0.132,0.172]; [0.227,0.266]; [0.211,0.250]; [0.166,0.205]; [0.081,0.120]; [0.041,0.081]; [0.004,0.043];

Some LEXMED rules with probability intervals. * stands for ∧ . Estimating the probability by counting the frequency:

|Bef4 = inamed ∧ Sono |Bef4 = inamed|

= yes|

= 0.43.

Risk management using the cost matrix

Result of the PIT diagnosis Diagnosis

App. inamed

App. perforated

Negative

Other

Probability

0.24

0.16

0.55

0.05

What to do?

The cost matrix Probability of various diagnoses Therapy Operation Emergency operation Ambulant observ. Other Stationary observ. I I I I I

I

inamed

perforated

negative

other

0.25 0 500 12000 3000 3500

0.15 500 0 150000 5000 7000

0.55 5800 6300 0 1300 400

0.05 6000 6500 16500 0 600

Optimal decisions have (additional) costs 0. Goal: therapy with the smallest average cost of error. Probability vector = (0.25, 0.15, 0.55, 0.05) Expected cost for wrong decisions. For the rst line: (operation): matrix · vector = 0.25 · 0 + 0.15 · 500 + 0.55 · 5800 + 0.05 · 6000 = 3565. Cost-oriented agent

3565 3915 26325 2215 2175

Cost matrix in the binary case I

Diagnosis:

Appendicitis

and

NSAP

P(Appendicitis) = p1 P(NSAP) = p2 I I

Therapies: operation, ambulant observ. (send patient home) Cost matrix: 0 k2 . k1 0

I false positive, false negative

0 k2 k1 0

p1 k2 p2 · = p2 k1 p1

1/k1 (k2 p2 , k1 p1 ) = ((k2 /k1 )p2 , p1 )

same result with

0 k 1 0

,

Risk management. I I I

Working point of the diagnosis system k = 0: System congured extremely risky k → ∞ All patients are operated

Performance Measures

Sensitivity = P(classied positive|positive)

=

|positive and classied positive| , |positive|

Specity = P(classied negative|negative)

=

|negative and classied negative| |negative|

Performance / ROC-curve 1

Sensitivity

0.8

0.6

0.4

LEXMED Ohmann−Score Score w. LEXMED−data RProp w. LEXMED−data random decision

0.2

0 0

0.2

0.4

0.6

1 − Specifity

0.8

1

Application eld and experience

I I

I I I I I I I

Use in the diagnosis Quality assurance: comparing the diagnosis quality of hospitals with expert systems. Since 1999 in use in the 14-Nothelfer hospital in Weingarten www.lexmed.de Diagnosis quality is comparable to an experienced surgeon Commercial marketing very dicult Wrong time? Wish of patients for personal care! Since de Dombal 1972, 39 years passed. Will it take another 39 years?

Reasoning with Bayesian Networks

I I I

d variables X1 , . . . , Xd with n values each Probability distribution has nd − 1 values. In practice the distribution contains many redundancies.

Independent variables

P(X1 , . . . , Xd )

= P(X1 ) · P(X2 ) · . . . · P(Xd ).

conditional probabilities become trivial:18

P(A|B) =

I

P(A, B) P(A)P(B) = = P(A) P(B) P(B)

Example, from [RN03]

18 Naive-Bayes method! (see

Sec.

).

The alarm example

Example (J. Pearl)

Knowledge base:

P(J |Al ) = 0.90 P(J |¬Al ) = 0.05

P(M |Al ) = 0.70 P(M |¬Al ) = 0.01

P(Al |Bur , Ear ) P(Al |Bur , ¬Ear ) P(Al |¬Bur , Ear ) P(Al |¬Bur , ¬Ear ) A priori probabilities:

Requests:

= = = =

0.95 0.94 0.29 0.001,

P(Bur ) = 0.001

P(Bur |J ∨

M)

P(J |Bur )

P(Ear ) = 0.002. P(M |Bur )

Graphical representation of knowledge as bayesian network

Burglary

P (Bur ) 0.001

Alarm

John

Al P (J ) w 0.90 f 0.05

Earthquake

P (Ear ) 0.002

Bur Ear P (Al ) w w 0.95 w f 0.94 f w 0.29 f f 0.001

Mary

Al P (M ) w 0.70 f 0.01

Conditional independence

Denition Two variables A and B are called conditionally given C if P(A, B|C ) = P(A|C ) · P(B|C ).

independent,

Examples: P(J , M |Al ) P(J , Bur |Al )

= P(J |Al ) · P(M |Al ) = P(J |Al ) · P(Bur |Al )

Theorem The following equations are pairwise equivalent, which means that each individual equation describes the conditional independence for the variables

P(A, B|C )

= P(A|B, C ) = P(B|A, C ) =

A

and

B

given

C.

P(A|C ) · P(B|C ) P(A|C )

P(B|C )

Proof: P(A, B, C )

= P(A, B|C )P(C ) = P(A|C )P(B|C )P(C )

P(A, B, C )

= P(A|B, C )P(B|C )P(C ).

Thus: P(A|B, C )

= P(A|C )

(26) (27) (28)

Practical application

P(J |Bur ) =

P(J , Bur ) P(J , Bur , Al ) + P(J , Bur , ¬Al ) = P(Bur ) P(Bur )

(29)

and P(J , Bur , Al )

= P(J |Bur , Al )P(Al |Bur )P(Bur ) = P(J |Al )P(Al |Bur )P( (30)

P(J |Al )P(Al |Bur )P(Bur ) + P(J |¬Al )P(¬Al |Bur )P(Bur ) P(Bur ) = P(J |Al )P(Al |Bur ) + P(J |¬Al )P(¬Al |Bur ). (31)

P(J |Bur ) =

P(Al |Bur ) und P(¬Al |Bur ) fehlen!

P(Al , Bur ) P(Al , Bur , Ear ) + P(Al , Bur , ¬Ear ) = P(Bur ) P(Bur ) P(Al |Bur , Ear )P(Bur )P(Ear ) + P(Al |Bur , ¬Ear )P(Bur ) = P(Bur ) = P(Al |Bur , Ear )P(Ear ) + P(Al |Bur , ¬Ear )P(¬Ear ) = 0.95 · 0.002 + 0.94 · 0.998 = 0.94

P(Al |Bur ) =

P(J |Bur ) = 0.9 · 0.94 + 0.05 · 0.06 = 0.849.

Analogue: P(M |Bur ) = 0.659

P(J , M |Bur ) = P(J , M |Al )P(Al |Bur ) + P(J , M |¬Al )P(¬Al |Bur )) = P(J |Al )P(M |Al )P(Al |Bur ) + P(J |¬Al )P(M |¬Al )P(¬ = 0.9 · 0.7 · 0.94 + 0.05 · 0.01 · 0.06 = 0.5922

P(J ∨ M |Bur ) = P(¬(¬J , ¬M )|Bur ) = 1 − P(¬J , ¬M |Bur ) = 1 − [P(¬J |Al )P(¬M |Al )P(Al |Bur ) + P(¬J |¬Al )P(¬M |¬Al )P(¬ = 1 − [0.1 · 0.3 · 0.94 + 0.95 · 0.99 · 0.06] = 1 − 0.085 = 0.915. P(Bur |J ) =

P(J |Bur )P(Bur ) 0.849 · 0.001 = = 0.016 P(J ) 0.052 P(Bur |M ) = 0.056

(see Aufgabe

??).

P(Bur |J , M ) = 0.284

P(J |Bur ) = P(J |Al )P(Al |Bur ) + P(J |¬Al )P(¬Al |Bur )

Conditioning:

P(A|B) =

X c

P(A|B, C = c)P(C = c|B).

Software for bayesian networks: PIT 1 2 3 4 5 6 7 8 9 10 11 12 13 14

var Alarm{t,f}, Burglary{t,f}, Earthquake{t,f}, John{t,f}, Mary{t,f}; P([Earthquake=t]) = 0.002; P([Burglary=t]) = 0.001; P([Alarm=t] | [Burglary=t] AND [Earthquake=t]) P([Alarm=t] | [Burglary=t] AND [Earthquake=f]) P([Alarm=t] | [Burglary=f] AND [Earthquake=t]) P([Alarm=t] | [Burglary=f] AND [Earthquake=f]) P([John=t] | [Alarm=t]) = 0.90; P([John=t] | [Alarm=f]) = 0.05; P([Mary=t] | [Alarm=t]) = 0.70; P([Mary=t] | [Alarm=f]) = 0.01;

= = = =

0.95; 0.94; 0.29; 0.001;

QP([Burglary=t] | [John=t] AND [Mary=t]);

Response: P([Burglary=t] | [John=t] AND [Mary=t]) = 0.2841.

Bayesian Networks and MaxEnt

I

I

On input of CPTs or equivalent rules, the MaxEnt principle implies the same conditional independences and thus also the same answers as a Bayesian network.[Sch96] Thus Bayesian Networks are a special case of MaxEnt

Software for bayesian networks:JavaBayes

Software for bayesian networks: Hugin

I I I

More powerful is the professional tool Hugin Continuous variables possible Can also learn Bayesian networks, that is, generate the network fully automatically from statistical data

Development of bayesian networks I

For the variables v1 , . . . , vn with |v1 |, . . . , |vn | dierent values each, the distribution has a total of n Y i=1

I I

|vi | − 1

independent entries. Alarm example: 25 − 1 = 31 independent entries. For a node vi with ki parent nodes ei1 , . . . , eiki , the associated CPT has ki Y (|vi | − 1) |eij | j=1

entries.

All CPTs in the network together have n X i=1

(|vi | − 1)

ki Y j=1

|eij |

19

(32)

Alarm-example:

2 + 2 + 4 + 1 + 1 = 10

19 For the case of a node without ancestors we substitute the value 1 because the CPT for nodes without ancestors contains, with its a priori probability, exactly one value.

Special case

I I I I I I I

n variables Equal number b of values Each node has k parent nodes All CPTs together have n(b − 1)b k entries Complete distribution contains b n − 1 entries Local connection Network becomes modularized

LEXMED as bayesian network Leuko7

S3Q2

TRek6

PathU2

Abw3

RektS2

Alt10

Bef4

S2Q2

Losl2

S1Q2

Sono2

S4Q2

Ersch2

Darmg4

Sex2

I I

Size of the distribution: 20 643 839 Size of the bayesian network: 521 values

In the order (Leuko, TRek, Abw, Alt, Losl, Sono, Ersch, Darmg, Sex, S4Q, S1Q, S2Q, RektS, PathU, S3Q, Bef4) we calculate n X i=1

(|vi | − 1)

ki Y j=1

|eij |

= 6·6·4+5·4+2·4+9·7·4+1·3·4+1·4+1·2·4 +3 · 3 · 4 + 1 · 4 + 1 · 4 · 2 + 1 · 4 · 2 + 1 · 4 + 1 · 4 + 1 · 4 +1 · 4 + 1 = 521

Causality and Network Structure

Construction of a Bayesian network: 1. Design of the network structure usually performed manually 2. Entering the probabilities in the CPTs usually automated I I I I

Causes: burglary and earthquake Symptoms: John and Mary Alarm: hidden variable Considering causality: going from cause to eect

Burglary

Earthquake

Burglary

Earthquake

Burglary

Alarm

Earthquake

Alarm

John

Earthquake

Burglary

Alarm

John

Mary

Stepwise construction of the alarm network considering causality. Compare: Aufgabe ??

Semantics of bayesian networks

A

B

A

C

C C

A

B

B

There is no edge between A and B if they are independent (left) or conditionally independent (middle, right).

Requirements:

I I

Bayesian network has no cycles No variable has a lower index than any variable that predecessor

It holds P(Xn |X1 , . . . , Xn−1 )

= P(Xn |Parents(Xn )).

Theorem A node in a Bayesian network is conditionally independent from all non-successor nodes, given its parents.

If the parent nodes E1 and E2 are given, then all non-successor nodes B1 , . . . B8 are independent of A. B2 B3

B1 E1

E2 A

B5 B7

N1 N3

B4

B6 N2

N4

B8

Chain rule for bayesian networks: P(X1 , . . . , Xn )

=

n Y i=1

Thus, Equation

??

P(Xi |X1 . . . , Xi−1 )

=

n Y i=1

P(Xi |Parents(Xi ))

holds

P(J , Bur , Al )

= P(J |Al )P(Al |Bur )P(Bur )

The most important concepts and basics of Bayes-Networks are now known and we can summerize them [Jen01]:

Denition A Bayesian network is dened by:

• A set of variables (nodes) and a set of directed edges between these variables. • Each variable has nitely many possible values.

• The variables together with the edges form a directed acyclic graph (DAG). A DAG is a graph without cycles, that is, without paths of the form (A, . . . , A). • For every variable A the CPT (that is, the table of conditional probabilities P(A|Parents(A))) is given. Two variables A and B are called conditionally independent given C if P(A, B|C ) = P(A|C ) · P(B|C ) or, equivalently, if P(A|B, C ) = P(A|C ). Besides the basic rules of computation for probabilies, the following rules are also true:

Bayes' theorem:

P(A|B) =

P(B|A) · P(A) P(B)

Marginalization:

P(B) = P(A, B) + P(¬A, B) = P(B|A) · P(A) + P(B|¬A) · P(¬A) Conditioning:

P(A|B) =

P

c

P(A|B, C = c)P(C = c|B)

A variable in a Bayesian network is conditionally independent of all non-successor variables given its parent variables. IfX1 , . . . , Xn−1 are no successors of Xn , we have P(Xn |X1 , . . . , Xn−1 ) = P(Xn |Parents(Xn )). This condition must be honored during the construction of a network. During construction of a Bayesian network the variables should be ordered according to causality. First the causes, then the hidden variables, and the diagnosis variables last.

See also: d-Separation in [Pea88] and [Jen01].

Summary

I I I I I I

I I I

Probabilistic logic for reasoning under uncertain knowledge Method of maximum entropy models non-monotonic reasoning Bayesian networks as special case of MaxEnt Bayesian networks rely on independence assumptions In a Bayesian network, all CPTs must be lled completely With MaxEnt, arbitrary knowledge can be formulated. E.g.: I am pretty sure that A is true.: P(A) ∈ [0.6, 1] Inconsistency: P(A) = 0.7 and P(A) = 0.8 PIT recognizes inconsistency In some cases reasoning is possible anyway

Combination of MaxEnt and bayesian networks

1. 2. 3. 4.

Constructing a network by bayesian methodology Missing values for CPTs can be replaced by intervals or by other probabilistic formulas MaxEnt system completes the ruleset

LEXMED

I I I

I

Medical expert system LEXMED Better than linear score systems Scores are equivalent to the special case Naive-Bayes, that is, to the assumption that all symptoms are conditionally independent given the diagnosis (see Sec. , Aufgabe ??). LEXMED can learn knowledge from data given in a database (see Ch. ).

Outlook

I

I I I I I

I

Nowadays, bayesian inference is very important and well-developed Handling continuous variables possible under restrictions Causality in bayesian networks Undirected bayesian networks Literature: [Pea88; Jen01; Whi96; DHS01] Association for Uncertainty in Articial Intelligence (AUAI) (www.auai.org). UAI conference

Teil VIII

Machine Learning and Data Mining The Perceptron, a Linear Classier The Nearest Neighbour Method Decision tree learning Learning of bayesian networks The Naive Bayes classier

Why machine learning? Elaine Rich [Ric83]: Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better.,

and: Humans learn better than computers

⇒ Machine learning is very important for AI

Why machine learning?

I

I

Complexity of software development I

Behaviour of an autonomous robot

I

Expert systems

I

Spam lter

Solution: Hybrid software with programmed and learned components.

What is learning?

I I I I

Learning vocabulary of a foreign language? Memorizing a poem? Acquisition of mathematical skills? Learning skiing?

Sorting device for apples Features:

Size and color

Classication task:

(Classier)

Sort apples into merchandise classes A and B

Size [cm] Color [0: green, 1: red] Merchandise class

8 0.1 B

8 0.3 A

6 0.9 A

3 0.8 B

Training data for the apple sorting agent.

... ... ...

size − −

large −

−

+ + + + + + + + + + + + +

+

class A

−

class B

+ +

−

+ −

− + + + + + − − − + − − − − − − − − − − −− − − − − − − − − − − − − − − − −

small

green

red

color

Apple sorting equipment and some apples classied into merchandise classes A and B in feature space.

Curve divides the classes size − −

large −

−

−

small

− − − − − − − −− − − − − − − −

+ + + + + + + + + + + + +

+

class A

−

class B

+ + + −

+ + + + + − + − − − − − − − − − − − − − − −

green

red

color

In practice: 30 or more features!

n

Features:

In the n-dimensional feature space, a (n − 1)-dimensional hyperplane has to be found, which divides the two classes as well as possible.

Terms

maps a feature vector to a class value. Fixed amount of alternatives. Example: sorting apples. Classier:

maps a feature vector to a real number. forecasting share prices out of given feature values.

Approximation: Example:

The learning agent

color Agent size

classified apples (learning)

merchandize class

m

The learning agent in general

feature 1 ...

Agent

feature n

training data (learning)

class label / function value

Denition of machine learning

Machine Learning is the study of computer algorithms that improve automatically through experience.[Mit97]

Drawing on this, we dene:

Denition An agent is a learning agent if it improves its performance (measured by a suitable criterion) on new, unknown data over time (after it has seen many training examples).

Terms (apple sorting example) learning the mapping from size and color of an apple to a merchandise class. Performance measure: amount of correctly classied apples. Variable agent (more precisely a class of agents): the learning algorithm determines the class of all learnable functions. Training data (experience): the training data contains the knowledge which the learning algorithm is supposed to extract. Test data: evaluates whether the trained agent can generalize well from the training data to new data. Task:

Data mining

What is data mining? I I I I

Extract knowledge from training data Make knowledge understandable to humans Example: decision tree induction Knowledge management: I

Analyzing customer preferences, e.g. in e-shops

I

Selective advertising

Denition The process of acquiring knowledge from data, as well as its representation and application, is called data mining . Literature: I. Witten und E. Frank. Data Mining. Von den Autoren in Java entwickelte DataMining Programmbibliothek WEKA: (www.cs.waikato.ac.nz/~ml/weka). Hanser Verlag München, 2001

Data analysis LEXMED data, N = 473 patients: Var. num.

Description

Values

1

age

continuous

2

sex (1=male, 2=female)

1,2

3

pain quadrant 1

0,1

4

pain quadrant 2

0,1

5

pain quadrant 3

0,1

6

pain quadrant 4

0,1

7

local muscular guarding

0,1

8

generalized muscular guarding

0,1

9

rebound tenderness

0,1

10

pain on tapping

0,1

11

pain during rectal examination

0,1

12

axial temperature

continuous

13

rectal temperature

continuous

14

leucocytes

continuous

15

diabetes mellitus

0,1

16

appendicitis

0,1

x1 = (26, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 37.9, 38.8, 23100, 0, 1) x2 = (17, 2, 0, 0, 1, 0, 1, 0, 1, 1, 0, 36.9, 37.4, 8100, 0, 0) Mean

N 1 X p x x¯i := N p=1 i

Standard deviation

v u u si := t Covariance

N

1 X p (x − x¯i )2 . N − 1 p=1 i N

1 X p σij = (x − x¯i )(xjp − x¯j ) N − 1 p=1 i

Correlation coecient:

Kij =

1. −0.009 0.14 0.037−0.096 0.12 0.018 −0.009 1. −0.0074−0.019−0.06 0.063−0.17 0.14 −0.0074 1. 0.55 −0.091 0.24 0.13 0.037−0.019 0.55 1. −0.24 0.33 0.051 −0.096−0.06 −0.091 −0.24 1. 0.059 0.14 0.12 0.063 0.24 0.33 0.059 1. 0.071 0.018−0.17 0.13 0.051 0.14 0.071 1. 0.051 0.0084 0.24 0.25 0.034 0.19 0.16 −0.034−0.17 0.045 0.074 0.14 0.086 0.4 −0.041−0.14 0.18 0.19 0.049 0.15 0.28 0.034−0.13 0.028 0.087 0.057 0.048 0.2 0.037−0.017 0.02 0.11 0.064 0.11 0.24 0.05 −0.034 0.045 0.12 0.058 0.12 0.36 −0.037−0.14 0.03 0.11 0.11 0.063 0.29 0.37 0.045 0.11 0.14 0.017 0.21 −0.0001 0.012−0.2 0.045 −0.0091 0.14 0.053 0.33

σij . si · sj

0.051−0.034 −0.041 0.034 0.037 0.05−0.037 0.37 0.012 0.0084 −0.17−0.14−0.13−0.017 −0.034 −0.14 0.045 −0.2 0.24 0.045 0.18 0.028 0.02 0.045 0.03 0.11 0.045 0.25 0.074 0.19 0.087 0.11 0.12 0.11 0.14 −0.0091 0.034 0.14 0.049 0.057 0.064 0.058 0.11 0.017 0.14 0.19 0.086 0.15 0.048 0.11 0.12 0.063 0.21 0.053 0.16 0.4 0.28 0.2 0.24 0.36 0.29 −0.0001 0.33 1. 0.17 0.23 0.24 0.19 0.24 0.27 0.083 0.084 0.17 1. 0.53 0.25 0.19 0.27 0.27 0.026 0.38 0.23 0.53 1. 0.24 0.15 0.19 0.23 0.02 0.32 0.24 0.25 0.24 1. 0.17 0.17 0.22 0.098 0.17 0.19 0.19 0.15 0.17 1. 0.72 0.26 0.035 0.15 0.24 0.27 0.19 0.17 0.72 1. 0.38 0.044 0.21 0.27 0.27 0.23 0.22 0.26 0.38 1. 0.051 0.44 0.083 0.026 0.02 0.098 0.035 0.044 0.051 1. −0.0055 0.084 0.38 0.32 0.17 0.15 0.21 0.44 −0.0055 1.

Correlation matrix for the 16 appendicitis variables measured in 473 cases.

Correlation matrix as density plot

Kij = −1:

black,

Kij = 1:

white

|Kij | = 0:

black,

|Kij | = 1:

white

The Perceptron, a Linear Classier

I I

Linearly separable two dimensional data set. Equation for the dividing line: a1 x1 + a2 x2 = 1

x2 1/a2

− −

− − −

+ + + + + + + + +

− − +

− −

− −

− −

− − − − − − −

1/a1

x1

Linear separability A n − 1-dimensional hyperplane in Rn is given by n X

ai xi = θ

i=1

thus we dene:

Denition Two sets M1 ⊂ Rn and M2 ⊂ Rn are called if real numbers a1 , . . . , an , θ exist with n X i=1

ai xi > θ

for all x ∈ M1

and

The value θ is denoted the threshold.

n X i=1

linearly separable

ai xi ≤ θ

for all x ∈ M2

Linear separability

x2

x2

AND

XOR

1

1

−x1 + 3/2 x1

x1 0

1

0

1

AND is linearly separable, but XOR is not(• =true ˆ ,

◦ =false ˆ ).

Perceptron P (x)

w1 x1

w2 x2

w3 x3

wn xn

Denition Let w = (w1 , . . . , wn ) ∈ Rn be a weight vector and x ∈ Rn an input vector. A perceptron represents a function P : Rn → {0, 1} with  n   1 if w x = X w x > 0 i i P(x) = i=1   0 else

Perceptron

I I I I

Directed two-layer neural network The input variables xi are called features The separating hyperplane hits the origin P Points x above the hyperplane ni=1 wi xi = 0 are classied as positive (P(x) = 1)

The learning rule M+ : set of positive training patterns M− : set of negative training patterns 20 , p. 164 : PerceptronLearning(M+ , M− )

w = arbitrary vector of real numbers Repeat

x ∈ M+ w x ≤ 0 Then w = w + x For all x ∈ M− If w x > 0 Then w = w − x Until alle x ∈ M+ ∪ M− are correctly classied For all If

20 M. Minsky und S. Papert.

Perceptrons.

MIT Press, Cambridge, MA, 1969.

PerceptronLearning(M+ , M− ) ...

x ∈ M+ If w x ≤ 0 Then w = w + x For all x ∈ M− If w x > 0 Then w = w − x

For all

...

(w + x) · x = w x + x2

⇒

w x at some point becomes positive!

(w − x) · x = w x − x2

⇒

w x at some point becomes negative!

21 Caution! This is not a proof of convergence!

21

Example

I I I

M+ = {(0, 1.8), (2, 0.6)}, M− = {(−1.2, 1.4), (0.4, −1)} Initial weight vector: w = (1, 1) Straight line w x = x1 + x2 = 0

(−1.2, 1.4) falsely classed, because (−1.2, 1.4) · Thus w = (1, 1) − (−1.2, 1.4) = (2.2, −0.4)

1 1

= 0.2 > 0.

Convergence

Theorem M− be linearly separable by a hyperplane w x = 0. Then PerceptronLearning converges for every initialization of the vector w . The perceptron P with the weight vector so calculated divides the classes M+ and M− ,

Let classes

M+

and

that is:

P(x) = 1

⇔

x ∈ M+

P(x) = 0

⇔

x ∈ M− .

and

Linear separability

I I I I

Perceptron cannot divide arbitrary linearly separable classes In R2 straight line goes through the origin In Rn hyperplane goes through the origin P because ni=1 wi xi = 0.

Trick xn := 1 Weight wn =: −θ acts as a threshold n X i=1

I

I

wi xi =

n−1 X i=1

wi xi − θ > 0

(bias unit),

⇔

n−1 X

because

wi xi > θ

i=1

Add a component with constant value 1 to every training vector! The weight wn (threshold θ) will be learned too!

A perceptron Pθ : Rn−1 → {0, 1}  n−1 X   1 if wi xi > θ Pθ (x1 , . . . , xn−1 ) = i=1   0 else

(33)

with an arbitrary threshold can be simulated by a perceptron P : Rn → {0, 1} with the threshold 0.

Theorem A function

f : Rn → {0, 1}

can by represented by a perceptron

if and only if the two sets of positive and negative input vectors are linearly separable.

Compare Equation separability.

Proof:

??

with the denition of linear

Example

5

5

5

5

5

5

5

4

4

4

4

4

4

4

3

3

3

3

3

3

3

2

2

2

2

2

2

2

1

1

1

1

1

1

1

0

0

0

0

0

0

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3

4

5

Positive training examples (M+ )

I I

0

1

2

3

4

5

0

1

2

3

4

5

0 0

1

2

3

4

5

Negative training examples (M− )

0

1

2

3

4

5

Test pattern

Training data learned in 4 iterations over all patterns Evaluating the generalization capability: noisy patterns with a variable amount of inverted (consecutive) bits.

Relative correctness of the perceptron dependending on the amount of inverted bits in the training data. Correctness 1.0 0.8 0.6 0.4 0.2 Bitflips 0

5

10

15

20

25

Optimization and outlook I I

I

Slow convergence Acceleration by normalization of weight change vectors: w = w ± x/|x|. Better initialization of the vector w. X X w0 = x − x, x∈M+

I

I

x∈M−

(see Aufgabe ??) Multilayer networks, e.g. backpropagation are more powerful (Sec. ) Backpropagation can divide classes which are not linearly separable.

The Nearest Neighbour Method

I I I I I

Rote learning with generalization Physician memorizes cases 1. at the diagnosis: remembering similar cases from the past. 2. make the same or a similar diagnosis. Diculties: I Physician must have a good sense for similarity. I

Is the found case similar enough?

What does similarity mean? The smaller their distance in the feature space, the more two examples are similar.

x2

− − − − −

− − − −

+

−

− + + + + + +

+

+

+

−

−

x1 The point marked black is classied negative.

Distance

Distance d(x, y) between two points: metric given by the euclidean norm: v u n uX d(x, y) = |x − y| = t (xi − yi )2 . i=1

weighted features:

v u n uX wi (xi − yi )2 . dw (x, y) = |x − y| = t i=1

Algorithm

NearestNeighbour(M+ , M− , s)

t = argminx∈M+ ∪M− {d(s, x)} If t ∈ M+Then Return(+) Else Return()

Separating classes (virtually)

−

−

−

+ −

+

+

− −

+

+

Voronoi diagramm (left) and the dividing line (right).

I I

Nearest neighbour method is more powerful than the perceptron Arbitrarily complex dividing lines (hyperplanes) are possible −

x2

− − −

− −

+ + + + + + + + + + + +

−

+

−

−

− − − − − − − −

+

−

x1

The point marked black is classied wrongly. I I I

How to deal with outliers? Wrong adaptions to random errors (noise) Overtting.

Example I We apply

NearestNeighbour

I Hamming distance as metric

at Beispiel ??. 22

22 The hamming distance between two bit-vectors is the amount of dierent bits of the vectors.

Correctness 1.0 0.8 0.6 0.4 0.2 Bitflips 0

5

10

15

20

25

Correctness of the nearest neighbour method (grey: perceptron)

k-nearest neighbour method Majority decision regarding the k nearest neighbours:

The algorithm k-NearestNeighbour. k-NearestNeighbour(M+ , M− , s)

V = {k

nearest neighbours

M+ ∪ M− }

If |M+ ∩ V | > |M− ∩ V | Then ElseIf |M+ ∩ V | < |M− ∩ V | Then Else

Return(+) Return() Return(Random(+, ))

More than two classes

I

I

I

Nearest-Neighbour-Klassikation für mehr als zwei Klassen: Klasse des nächsten Nachbarn k-Nearest-Neighbour-Methode: Klasse mit Mehrheit unter den k nächsten Nachbarn. k-Nearest-Neighbour-Methode nur bei wenigen Klassen, bis etwa 10 Klassen

Example 1

Difference of motor voltages v

Agent

Difference of motor voltages v

Autonomous robot with simple sensors learns to avoid light.

0.5 0 −0.5 Training data Classification

−1 0

I Sensor signals

sl

0.5 1 1.5 Ratio of sensor inputs x

from left and

sr

2

1 0.5 0 −0.5 Training data Approximated function

−1 0

0.5 1 1.5 Ratio of sensor inputs x

2

from right sensor,

x = sr /sl I Based on

x,

the dierence

v = Ur − Ul

of the voltages at the

two motors has to be determined.

I The agent has to nd a mapping the correct value

v = f (x).

f,

that for any

x

calculates

Approximation methods

I I I I

Polynomial interpolation, spline interpolation, least squares Problematic in higher dimensions Diculty: in AI, we need model-free approximations Neural networks, support vector machines, Gaussian processes, ..., nearest neighbor

k-nn for approximation problems

1. Determining the set V = {x1 , x2 , . . . xk } of the k nearest neighbours of x 2. Average function value k 1X ˆ f (xi ) f (x) = k i=1

The larger k becomes, the smoother the function fˆ is.

(34)

The distance is relevant

At big k : I I I

Many neighbours with big distance Few neighbours with small distance Calculation of fˆ is dominated by far away neighbours

Solution:

in k-NearestNeighbour votes are weighted

wi =

1 , 1 + αd(x, xi )2

(35)

Approximation: Equation

??

is replaced by

fˆ(x) =

I

I

Pk

i=1 wi f (xi ) . P k w i i=1

The inuence of points approaches zero with increasing distance. All training data can be used for classication/approximation.

k-nn with/without distance weighting k -nearest neighbour 10 000

10 000

10 000

8000

8000

8000

6000

6000

6000

4000

4000

4000

20

22

24

26

28

20

22

24

26

28

20

k=1 k=2 Nearest neighbour with distance weighting 12 000

12 000

10 000

10 000

10 000

8000

8000

8000

6000

6000

6000

4000

4000

4000

22

24

α = 20

26

20

22

α=4

24

24

2

k=6

12 000

20

22

26

20

22

α=1

24

Time complexity

I I I I I I I

Learning by simple memorizing Therefore, very fast learning Classication/approximation of a vector x is very expensive Finding the k nearest neighbours with n training data: Θ(n) Classication/approximation: Θ(k) Total computing time: Θ(n + k) Nearest neighbour methods = lazy learning

Example: avalanche forecast Task: avalanche hazard level to be determined from the amount of fresh snow. lazy learning 5

avalanche hazard level

4 eager learning 3 2 1 20

50

100

3 day new snow sum [cm]

200

Nearest Neighbour methods: elds of application

Nearest neighbour methods are well suited,

for problems which need a good local approximation, but which do not place a high requirement on the speed of the system. Nearest neighbour methods are not suited,

when a description of the knowledge extracted from the data is required, which must be understandable by humans (data mining).

Case-based reasoning (CBR)

I

I

Extension of nn methods to symbolic problem descriptions and their solution. [Ric03] Fields of application: I

Diagnosis of technical problems

I

Telephone hotlines, second level support

CBR example Feature

Query

Case from case base

Defective part: Bicycle model: Year: Power source: Bulb condition: Light cable condition:

Rear light Marin Pine Mountain 1993 Battery ok ?

Front light VSF T400 2001 Dynamo ok ok

Diagnosis: Repair:

? ?

Solution

Front electrical contact missing Establish front electrical contact

Transformation: rear light is mapped to front light

CBR-Schema

query x

solution for x

transformation

reverse transformation

case y

solution for y

CBR: problems

Can the developer predict and map all possible special cases and problem variants? Similarity A suitable similarity metric for symbolic features. Transformation How to nd the transformation mapping and its inverse? Modeling

Solutions: I Interesting alternative to CBR: bayesian networks. I Symbolic problem representation can often be mapped on numeric features. I Then other methods are applicable.

Decision tree learning

Snow_Dist

(1,2,3,4,5,6,7,8,9,10,11)

> 100

100 yes, no

Description Should I drive to the nearest ski resort with enough snow? Is there sunshine today? Distance to the nearest ski resort with good snow conditions (over/under 100 km) Is it the weekend today?

Variablesfor the skiing classication problem.

Training data

Day 1 2 3 4 5 6 7 8 9 10 11

Snow_Dist

Weekend

Sun

Skiing

≤ 100 ≤ 100 ≤ 100 ≤ 100 > 100 > 100 > 100 > 100 > 100 > 100 > 100

yes yes yes no yes yes yes yes no no no

yes yes no yes yes yes yes no yes yes no

yes yes yes yes yes yes no no no no no

The rows 6 and 7 are contradicting! Thus the tree from Abbildung ?? is optimal!

How does the tree evolve from data?

Optimal algorithm:

create all trees, test them on the data and nally choose the tree with the minimal error. Disadvantage:

Thus:

inacceptably large computing time!

heuristic algorithm with greedy strategy

Entropy as a metric for information content I

I

The attribute with the highest information gain shall be choosen. Training data set S = (yes,yes,yes,yes,yes,yes,no,no,no,no,no) with the estimated probabilities

p1 = p(yes) = 6/11 and p2 = p(no) = 5/11. I

Probability distribution

p = (p1 , p2 ) = (6/11, 5/11), I

in general

p = (p1 , . . . , pn ) I

with

n X i=1

pi = 1.

Two extreme cases Sure event:

p = (1, 0, 0, . . . , 0).

(36)

1 1 1 p = ( , ,..., ) n n n

(37)

Uniform distribution:

Unvertainty is maximal!

Claude Shannon:

how many bits would be needed minimally to encode such an event? 1. 1. case (Equation 2. 2. case (Equation

??):

zero bits ??): log2 n bits

General case: p = (p1 , . . . , pn ) = ( I I

n X i=1

I

(38)

log2 mi bits ar required for the i th case. Expected value H for the number of bits: H=

I

1 1 1 , ,..., ) m1 m2 mn

pi log2 mi =

n X i=1

pi log2 1/pi = −

n X

pi log2 pi .

i=1

The more bits we need to encode an event, the higher the uncertainty about the outcome. Thus: entropy H as a metric for the uncertainty of a probability distribution:

H(p) = H(p1 , . . . , pn ) := −

n X i=1

pi log2 pi .

Problem: 0 log2 0 undened! Denition 0 log2 0 := 0 (see Aufgabe ??). Thus: H(1, 0, . . . , 0) = 0

and

1 1 H( , . . . , ) = log2 n n n unique global maximum of the entropy!

Special case: two classes

H(p) = H(p1 , p2 ) = H(p1 , 1−p1 ) = −(p1 log2 p1 +(1−p1 ) log2 (1−p1 )) 1 0.9 0.8

H(p,1−p)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5 p

0.6

0.7

0.8

0.9

1

The entropy function for the case of two classes.

Data set D with probability distribution p:

H(D) = H(p). Information content I (D) from the data set D is the opposite of uncertainty, thus I (D) := 1 − H(D). (39)

The information gain

If we now apply the entropy formula to the example, the result is

H(6/11, 5/11) = 0.994 Information gain G(D,A) =

n X |Di | i=1

|D|

I (Di ) − I (D)

With Equation G(D,A) =

??

we obtain

n X |Di | i=1

= 1−

|D|

I (Di ) − I (D) =

n X |Di | i=1

= H(D) −

|D|

i=1

|D|

(1 − H(Di )) − (1 − H(D)

H(Di ) − 1 + H(D)

n X |Di | i=1

n X |Di |

|D|

H(Di )

4 7 G(D,Snow_Dist) = H(D) − H(D≤100 ) + H(D>100 ) 11 11 4 7 ·0+ · 0.863 = 0.445 = 0.994 − 11 11

Analogously we obtain G(D,Weekend) = 0.150 and

G(D,Sun) = 0.049

thus: Snow_Dist becomes the root node [6,5]

[6,5]

[6,5]

Snow_Dist

Weekend

Sun

100

[4,0] Gain:

[2,5] 0.445

yes

no

yes

no

[5,2]

[1,3]

[5,3]

[1,2]

0.150

Vergleich

0.049

Day 1 2 3 4 5 6 7 8 9 10 11

Snow_Dist

Weekend

Sun

Skiing

≤ 100 ≤ 100 ≤ 100 ≤ 100 > 100 > 100 > 100 > 100 > 100 > 100 > 100

yes yes yes no yes yes yes yes no no no

yes yes no yes yes yes yes no yes yes no


Data set for the skiing classication problem. For D≤100 , the classication is clearly Thus the tree terminates here.

yes.

In the other branch D>100 there is no clear result. Thus G(D>100 ,Weekend) = 0.292 G(D>100 ,Sun) = 0.170 Thus the node contains the attribute Weekend. For Weekend = no, the tree terminates with the decision For Weekend = yes, Sun results in a gain of 0.171. The construction of the tree terminates because no further attributes are available.

no.

Snow_Dist

(1,2,3,4,5,6,7,8,9,10,11)

> 100

100, >100, >100,

yes, yes, yes, no, yes, yes, yes, yes, no, no, no,

yes, yes, no, yes, yes, yes, yes, no, yes, yes, no,


Information about attributes and classes in le ski.names:

|Classes: | no,yes. | |Attributes | Snow_Dist: Weekend: Sun:

no: do not ski, yes: go skiing

100. no,yes. no,yes.

unixprompt> c4.5 -f ski -m 1

C4.5 [release 8] decision tree generator Wed Aug 23 ---------------------------------------Options: File stem Sensible test requires 2 branches with >=1 cases Read 11 cases (3 attributes) from ski.data Decision Tree: Snow_Dist = Snow_Dist = | Weekend | Weekend | | Sun | | Sun

100: = no: no (3.0) = yes: = no: no (1.0) = yes: yes (3.0/1.0)

Simplified Decision Tree: Snow_Dist = 100: no (7.0/3.4) Evaluation on training data (11 items): Before Pruning ---------------Size Errors 7

1( 9.1%)

After Pruning --------------------------Size Errors Estimate 3

2(18.2%)

(41.7%)

c4.5 -f app -u -m 100 C4.5 [release 8] decision tree generator ----------------------------------------

Wed Aug 23 13:13:15 2006

Read 9764 cases (15 attributes) from app.data Decision Tree: Leukocytes 381 : 1 (135.9/54.2) | | Temp_rectal 11030 : 1 (5767.0/964.1) Leukocytes 381 : 1 (135.9/58.7) | | Temp_rectal 8600 : 1 (984.7/322.6) | | Leukocytes 378 : 1 (176.0/64.3) | | | Temp_rectal 10^(-3) CRold = CR; PR = A’ * CR; CR = B * PR; colsums = sum(CR); for j = 1:m CR(:,j) = CR(:,j)/colsums(j); endfor CR = CR + eta * eye(m); delta = norm(CR - CRold); endwhile endfunction

The link-analysis algorithm, result Product representativeness:

Interaction matrix A:

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

p1 1 1 1 1 1 0 0 0 0 1

p2 0 1 0 0 1 0 1 1 1 1

p3 1 1 1 1 1 0 1 1 0 0

p4 0 0 1 1 0 0 0 0 0 1

p5 0 0 0 0 0 1 0 0 1 0

p6 0 1 0 1 1 0 0 0 1 0



PR T

       =       

1.73 1.66 1.76 1.75 1.66 0.48 0.59 0.59 0.49 1.71

0.59 1.67 0.56 0.58 1.67 0.56 1.72 1.72 1.70 1.66

1.88 1.82 1.86 1.84 1.82 0.60 1.84 1.84 0.64 0.78

0.33 0.28 1.39 1.37 0.28 0.22 0.25 0.25 0.21 1.37

0.04 0.09 0.03 0.06 0.09 1.33 0.08 0.08 1.27 0.07

0.37 1.43 0.37 1.43 1.43 0.38 0.37 0.37 1.46 0.39

               

Comparison: user-based vs. link-analysis

Interaction matrix A:

        WC ·A =       

4.60 4.38 4.37 4.35 4.38 1.45 3.50 3.50 1.63 3.67

3.40 4.50 2.67 2.65 4.50 2.05 4.50 4.50 3.33 3.33

Product representativeness:

5.35 5.38 4.71 4.60 5.38 1.85 5.00 5.00 2.29 3.67

2.10 1.62 2.52 2.35 1.62 0.74 1.38 1.38 0.48 2.17

0.49 0.68 0.28 0.33 0.68 1.72 0.84 0.84 1.72 0.61

2.40 3.12 1.85 2.40 3.12 1.27 2.38 2.38 2.15 1.83

 1.73 0.59 1.88   1.66 1.67 1.82     1.76 0.56 1.86     1.75 0.58 1.84    T  1.66 1.67 1.82 PR =    0.48 0.56 0.60     0.59 1.72 1.84     0.59 1.72 1.84   0.49 1.70 0.64 1.71 1.66 0.78 

0.33 0.28 1.39 1.37 0.28 0.22 0.25 0.25 0.21 1.37

0.04 0.09 0.03 0.06 0.09 1.33 0.08 0.08 1.27 0.07

0.37 1.43 0.37 1.43 1.43 0.38 0.37 0.37 1.46 0.39

              

Content-based ltering

I I I I

Products are recommended based on their quality. Product description via a set of features x1 , . . . xk . Goal: nd a mapping from product features to a score. Sort products wrt. their score

Example: the system yakadoo


...

more questions



I I I

I

Expert system for customer service Replaces the salesperson Goal: Recommend a product that optimally matches the customer's requirements Solution 1: Nearest neighbour method

Nearest neighbour ltering

I I

Target feature vector t = (t1 , . . . , tk ) from customer interview Find most similar product in database containing items31 x1 , . . . , xn recommended product = argmin d(t, xi ) i=1,...,n

I

Euclidean distance

d(t, x) =

k X i=1

31 More precise:

x1 , . . . , xn

(ti − xi )2 .

are the feature vectors of the items.

Weighted nearest neighbour ltering

I

Weighted Euclidean distance

d(t, x) =

k X i=1

I

wi (ti − xi )2 .

How to nd weight vector (w1 , . . . , wk )? I

Manually

I

Minimize number of misclassications wrt. weight vector!

Teil IX

Neural Networks From Biologie to Simulations Hopeld networks Neural associative memory Linear networks with minimum error The backpropagation algorithm Support-Vector-Machines

Neural Networks

I I I

I I I I I

The human brain has between 10 and 100 billion nerve cells. Around the year 1900 biologists started understanding neurons 1943 McCulloch and Pitts: A logical calculus of the ideas immanent in nervous activity (in [AR88]). Bionics-fork within AI Every neuron is connected to 1000-10.000 other neurons. Approximately 1014 connections Time of an impulse of an neuron is appr. one millisecond Thus, the frequency of an neuron is below one kilohertz

Biologie

cell body axon

dendrite synapse

Simulation

The mathematical model Charging of the activation potential by summation of weighted output values of x1 , . . . , xn input connections n X

wij xj .

j=1

Neuron i

x1 . . .

xj

. . .

xn

xi = f (

n X j=1

wij xj )

xi

Activation functionf

xi = f

n X

! wij xj

j=1

Threshold function (Heaviside step function) 0 if x < Θ f (x) = HΘ (x) = . 1 else The neuron computes its output by Pn 0 if j=1 wij xj < Θ xi = . 1 else The equation is identical with a Perzeptron with threshold Θ.

xi

w1 x1

w2 x2

w3

wn

x3

xn

Step function is discontinuous. Alternative:

Sigmoid function:

1 0.8 0.6 f(x) 0.4

T = 0.01 T = 0.4 T=1 T=2 T=5 T = 100

0.2 0

Θ

f (x) =

1 1 + e−

x−Θ T

Modeling learning

D. Hebb, 1949 (Hebb-rule): If there is an connection wij from neuron j and i where repeatedly signals from neuron j to neuron i are sentwhich leads to a parallel activation of both neuronsthen weight wij is amplied. A possible formula for the weight change ∆wij is

∆wij = ηxi xj with a constant η (learning rate) that denotes the size of learning step.

Hopeld networks

I I I

With the Hebb-rule, the weights can only grow. Weakening is not possible. Weakening can be modeled with a constant decay parameter that decays unused weights at each time step (e.g. by 0.99).

Another Solution (Hopeld 1982) I I I

binary neuron with value ±1 ∆wij = ηxi xj When does ∆wij becomes positive / negative ?

Autoassociative Memory

I I I

In an autoassociative memory, patterns can be memorized. Recall: input of similar patterns. Classical Application: Recognition of characters.

Learning phase:

N binary coded patterns q1 , . . . , qN shall be learned. qij ∈ {−1, 1}: one pixel per pattern. I I I I

n pixel: neural network consisting of n neurons Neurons are completely connected. Weight-matrix symmetrical Diagonal elements wii = 0

Recurrent connection of two neurons

wji xi

xj wij

Learning

N

wij =

1X k k q q . n k=1 i j

(43)

Relation to the Hebb-rule!

Pattern recognition

Input of a new pattern x and update of all neurons by  n X    −1 if wij xj < 0 xi = j=1  j6=i   1 else until network gets stable.

(44)

Programming scheme

HopeldAssociator(q) Initialize all neurons: x = q Repeat

i = Random(1,n) Update neuron i according to Equation Until x converges Return(x)

??

Application to a pattern-recognition example

I I I

Digits in a 10 × 10 pixel eld Hopeld network consists of 100 neurons totally 100·99 = 4950 weights 2

The four patterns learned by the network.

10% Noise

62 Steps

144 Steps

232 Steps

379 Steps

10% Noise

65 Steps

147 Steps

202 Steps

301 Steps

10% Noise

66 Steps

112 Steps

150 Steps

359 Steps

A few stable states of the network that were not learned.

The ten learned patterns.

↓

Six noisy patterns with 10% noise.

↓

↓

↓

↓

The stable states from noisy patterns.

↓

Analysis

neurons

elementary magnets

wij xi 1

xj wji

-1

− 21

1 2

wij

Neural and physical interpretation of the Hopeld model.

Total energy of the system:

E =−

1X wij xi xj . 2 i,j

Applies also in physics: wii = 0 and wij = wji .

Hopeld dynamics I I I

Physical system in equilibrium minimizes E (x, y). The system moves to a state of minimum energy. In every step, the Hopeld dynamic updates the state of a neuron towards minimum total energy.

Contribution of neuron i to the total energy: n

1 X − xi wij xj . 2 j6=i If

n X

wij xj < 0

j6=i

then xi = −1 gives a negative contribution to the total energy; on the other hand, xi = 1 gives a positive contribution.

Analogously:

n X j6=i

wij xj ≥ 0

requires xi = 1 I I I

The total energy of a system reduces monotonically over time. Network moves to a state with minimum energy. What signicance has the minimum of the energy function?

Analogously: achieving

n X j6=i

wij xj ≥ 0

requires xi = 1 I I I I I

I I

total energy of the system reduces monotonically over time. Network moves to a state with minimum energy. What signicance has the minimum of the energy function? The learned patterns are minima of the energy function. If too much patterns were learned, the system converges to minima that are not the learned patterns. Change from order to chaos. phase change.

Physics I I I

Melting of a ice crystal. Inside the crystal is a state of high order. In uid water, the structure of the molecule is debrated.

Neural network I I I I

phase change ordered learning and recognition of patterns chaotic learning with too many patterns eects that we observe by ourselves

Phase change in Hopeld networks are neurons are in pattern state q1 n X In wij qj all learned weights from Equation

??

are used:

j=1

j6=i n X j=1

j6=i

wij qj1

N n n N X 1 XX k k 1 1 X 1 1 2 = q q q = qi (qj ) + qik qjk qj1 n j=1 k=1 i j j n j=1 k=2 j6=i

j6=i n

= qi1 +

!

N

1 XX k k 1 q q q n j=1 k=2 i j j j6=i

i 'th component of the input pattern plus a sum of (n − 1)(N − 1) terms. If these terms are statistically independent,

the sum can be described by a normally distributed random variable with zero mean and standard deviation r 1p N −1 (n − 1)(N − 1) ≈ . n n I I I

I I

I

Noise is not disturbing, as long as N n! detailed computation says that critical point is N = 0.146 n. Example: When using 100 Neurons, up to 14 uncorrelated patterns can be memorized. Memory capacity is below that of a list memory! Hopeld networks word only good, if patterns have ca. 50% 1-bits Workaround: Neurons with threshold

Summary and outlook

I

I I

I

I I

The Hopeld model caused in the 80's a wave of enthusiasm for neural networks Hopeld networks are recurrent Networks without recurrent connections are easier to understand Problem: local minima ⇒ Boltzmann-Machine, simulated annealing Hopeld dynamic is also applicable to other energy functions e.g. Traveling salesman problem

Neural associative memory

I I I

I I

I I I

Phone book: Mapping from names to phone numbers realized in tables within a database Entrance control to a building by using a photography of the person Problems, when only photographs are memorized in a database Solution: associative memory: can relate similar photographs to the right name. typical task for machine learning methods. Straightforward approach: Nearest neighbour method. For entrance control not applicable. Why?

Correlation matrix memory given is:

Training data with N request-response pairs:

T = {(q1 , t1 ), . . . , (qN , tN )} wanted:

Matrix W, that maps correctly all requests vectors to

responses for p = 1, . . . , N p

or

p

t =W·q ,

tip

wij qjp respectively

j=1

t1

tj

w11

q1

=

n X

tm

wmn

qi

qn

(45)

Computation of the matrix elements wij (Hebb-rule):

wij =

N X

qjp tip

p=1

Denition Two vectors x and y are called orthonormal, if 1 if x = y x·y = 0 else

(46)

Theorem p request vectors q from training data are orthonormal, p any vector q multiplied with the matrix W is mapped to the p response vector t . If all

Proof:

N

We insert Equation

p

(W · q )i = =

n X

wij qjp

=

in Equation

n X N X

qjr tir qjp

n X

qjp qjp

j=1

+

N X r =1

| {z } =1

r 6=p

tir

??

=

j=1 r =1

j=1

tip

??

and derive

n X

(

j=1 n X

qjp qjp tip

+

N X r =1

r 6=p

qjr qjp = tip

j=1

| {z } =0

suppose, the name Hans is the correct output for a certain face. Possible output when inputting similar faces: z.B. Gans oder Hbns

Problem:

qjr qjp tir )

2

The pseudo inverse Request vector as columns of a n × N matrix: Q = (q1 , . . . , qN ) Response vector as columns of a m × N matrix T = (t1 , . . . , tN ):

T=W·Q

(47)

W = T · Q−1 .

(48)

Solving the equation to W:

How to invert a non-invertible matrix? A matrix Q is invertible if there is a matrix Q−1 with the property

Q · Q−1 = I, Wanted:

A matrix, as close as possible to this property

(49)

Denition Q is a real n × m matrix. A m × n matrix Q+ is denoted as pseudo inverse to Q, if it minimizes ||Q · Q+ − I|| Herein ||M|| is the quadratic norm, which is the sum over squares of all matrix elements of M .

So:

W = T · Q+

(50)

Weight matrix W minimizes the associative error (engl. crosstalk) T−W·Q . I I

The search for the pseudoinverse is not easy. Backpropagation algorithm

The binary Hebb rule (Palm-model) Patterns are binary encoded: qp ∈ {0, 1}n and tp ∈ {0, 1}m . n = 10, m = 6:

q3 q2 q1

11 1 11

11

1 1111 11

1 1 1 1 1

1 1 1 1 1

1 1 111 1 t1 t2 t3

Recall of the memorized patterns

q3 q2 q1

11 1 11

11

1 1111 11

1 1 1 1 1

1 1 1 1 1

2 0 3 3 1 0

3 0 2 3 0 0

0 0 1 3 3 0

Berechnung der Produkte Wq1 , Wq2 , Wq3 . N _ binary Hebb rule: wij = qjp tip . p=1

(51)

Memory capacity I

Weight matrix must be sparse

Matrix has m n elements. A pair to be memorized has m + n bits. Count of the memorizable patterns Nmax :32 Count of the memorizable bits (m + n)Nmax = ≤ ln2 ≈ 0.69. Count of the memory cells mn (52) Memory model α List memory 1 Associative Memory with binary Hebb rule 0.69 Kohonen associative memory 0.72 Hopeld networks 0.292

α=

32 G. Palm. On Associative Memory. In: S. 1931.

Biological Cybernetics

36 (1980),

An error-correction program Coding of pairs for request vectors q Request vector consists of 676 Bits for every pair aa, ab, . . . , az, ba, . . . , bz, . . . , za, . . . , zz. Example: For hans, the combinations ha, an and ns are set to 1. 26 · 26 = 676 ordered pairs of letters. In the response vector t, all positions in the word are reserved with 26 bit (word length is max. 10). For hans, the bits 8, 27, 66 and 97 are set. The weight matrix W has size 676 · 260 bit = 199420 bit. Memory capacity

Nmax ≤ 0.69

mn 676 · 260 = 0.69 ≈ 130 words m+n 676 + 260

Memorized names: agathe, agnes, alexander, andreas, andree, anna, annemarie, astrid, august, bernhard, bjorn, cathrin, christian, christoph, corinna, corrado, dieter, elisabeth, elvira, erdmut, ernst, evelyn, fabrizio, frank, franz, georey, georg, gerhard, hannelore, harry, herbert, ingilt, irmgard, jan, johannes, johnny, juergen, karin, klaus, ludwig, luise, manfred, maria, mark, markus, marleen, martin, matthias, norbert, otto, patricia, peter, phillip, quit, reinhold, renate, robert, robin, sabine, sebastian, stefan, stephan, sylvie, ulrich, ulrike, ute, uwe, werner, wolfgang, xavier

Associations of the program: Insert pattern: harry

Insert pattern: andrees

Threshold: 4, Response: harry Threshold: 6, Response: a Threshold: 3, Response: harry Threshold: 5, Response: andree Threshold: 4, Response: andrees Threshold: 2, Response: horryrrde Threshold: 3, Response: mnnrens ------------------------------- Threshold: 2, Response: morxsnssr Insert pattern: ute -------------------------------

Linear networks with minimum error Learning from errors Given: Training vectors

Idea:

T = {(q1 , t1 ), . . . , (qN , tN )} mit qp ∈ [0, 1]n tp ∈ [0, 1]m . a function f : [0, 1]n → [0, 1]m that minimizes the quadratic error N X (f (qp ) − tp )2

Wanted:

p=1

on the data.

Possible solution:

f (q) = 0, und

falls

q∈ / f (q) = 0,

f (qp ) = tp

if

q∈ / {q1 , . . . , qN }

∀ p ∈ {1, . . . , N}.

Why don't we become happy with this function?

Possible solution:

f (q) = 0, and

f (qp ) = tp

if

q∈ / {q1 , . . . , qN } ∀ p ∈ {1, . . . , N}.

Why don't we become happy with this function? I I I I I

Because we want to build an intelligent system! Generalizing from training data to new unknown data We do not want overtting We want a plain function that balances between the points But we have to further conne the function class.

The method of least squares Linear two-layered network y

w1 x1

w2 x2

y =f

w3

wn

x3 n X

xn

! wi xi

i=1

computed by f (x) = x . Sigmoid function is irrelevant, because it is strictly monotonically increasing!

Wanted: Vector w that minimizes

E (w) =

N X p=1

(wqp − t p )2 =

n N X X p=1

i=1

!2 wi qip − t p

Necessary condition for a minimum of the error function: For j = 1, . . . , n we require ! N n X X ∂E =2 wi qip − t p qjp = 0 . ∂wj p=1 i=1 Expanding gives N n X X p=1

i=1

! wi qip qjp − t p qjp

=0

.

Exchange of the sums: n X

wi

i=1

N X

qip qjp

=

p=1

N X

t p qjp ,

p=1

With

Aij =

N X p=1

qip qjp

and

bj =

N X

t p qjp

we get the matrix equation (normal equation):

Aw = b I

I

for i, j = 1, . . . , n

p=1

(53) (54)

Normal equations have at minimum one solution; and only one solution if A is invertible. A is positive denite, and thus the solution w is a global minimum.

Computing time: I I

Setting up of the matrix: Θ(N · n2 ) Solving the system of equations: O(n3 ).

This method can easily be extended to multiple output neurons

Application to appendicitis

Linear mapping of the symptoms to the continuous class variable AppScore: AppScore

=

0.00085

Alter

−0.025

S4Q

+0.081

I

+ 0.12

Ersch

+0.000021

− 0.125

Geschl

AbwLok

+ 0.0034

Leuko

+ 0.031

RektS

− 0.11

continuous values for

+ 0.025

Diab

+ 0.035

AbwGlo

+ 0.0027

TAxi

S2Q

+ 0.13

− 0.021

Losl

+ 0.0031

TRek

− 1.83.

AppScore,

I Threshold decision!

S1Q

although

App

is binary!

S3Q

Error rate

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2

Least squares on 6640 training data Least squares on 3031 test data RProp on 3031 test data

0

0.2

0.4 0.6 Treshold Θ

0.8

Error from least squares on training- and testdata.

1

The delta rule I I

So far: batch learning Now: incremental learning. In every step, weights are adjusted according to the new training sample wj = wj + ∆wj

I Error minimization,

thus again, partial derivative ! n N X X ∂E wi qip − t p qjp . =2 ∂wj p=1 i=1

I

Gradient

∇E =

I

∂E ∂E ,..., ∂w1 ∂wn

Gradient descent: N n X X ∂E ∆wj = −η = −η wi qip − t p ∂wj p=1 i=1

! qjp ,

By replacing of the activation p

y =

n X

wi qip

i=1

of the output neuron with input pattern qp : Delta rule

∆wj = η

N X p=1

(t p − y p )qjp .

DeltaLearning(Training

examples, η )

Initialize all weights wj arbitrarily Repeat

∆w = 0 For all (qp , tp ) ∈ Training examples Compute network output yp = wp qp ∆w = ∆w + η(tp − yp )qp w = w + ∆w Until w converges

Learning rate

η

∆wj = η

N X p=1

(t p − y p )qjp .

The incremental delta rule I

I

After all training examples have been applied, all weights are changed. Alternative: direct change of the weights after each training example DeltaLearningIncremental(Training

examples, η )

Initialize all weights wj arbitrarily Repeat

(qp , tp ) ∈ Training examples Compute network output yp = wp qp w = w + η(tp − yp )qp w converges

For all

Until

Comparison to the perceptron

I

I

I

With the perceptron, a classier for linearly separable classes is learned by threshold decision. Method of least squares and Delta rule generate a linear approximation on the data . From the learned linear approximation, a classier can be generated by applying a threshold function.

If incremental learning is not required, the method of least squares should be favored (among others because of computing times)

The backpropagation algorithm

I I I I

Extension of the incremental delta rule with sigmoid function more than two layers of neurons Known from the article33 of the legendary PDP-Collection34 .

33 D.E. Rumelhart, G.E. Hinton und Williams R.J.

Representations by Error Propagation. 34 D. Rumelhart und J. McClelland.

MIT Press, 1986.

Learning Internal

in [RM86]. 1986.

Parallel Distributed Processing.

Bd. 1.

2 w11

tjp

target output

xjp

output layer

wn23n2

hidden layer 1 w11

wn12n1 input layer

Neuron model:

n X

xj = f

! wji xi

(55)

i=1

with the sigmoid function

f (x) =

1 . 1 + e −x

Similar to the incremental delta rule, according to the negative gradient of the summed quadratic error function over all output neurons 1 X Ep (w) = (tkp − xkp )2 2 k∈Ausgabe

are changed for the training pattern p :

∆p wji = −η

∂Ep . ∂wji

Repeated application of the chain rule (see35 or36 ) provides the backpropagation learning rule (generalized delta rule)

∆p wji = ηδjp xip , with

 p p p p   xj (1 − xj )(tj − xj ) if j is output neuron X p δjp = p p  x (1 − x ) δk wkj if j is hidden neuron,  j j k

j

Naming of neurons and weights for the application of the backpropagation learning rule.

wji i

35 D.E. Rumelhart, G.E. Hinton und Williams R.J.

Learning Internal Representations by Error Propagation. in [RM86]. 1986. 36 A. Zell. Simulation Neuronaler Netze. Im Buch beschriebener Simulator

BackPropagation(Training

examples, η )

Initialize all weights wj arbitrarily Repeat For all

(qp , tp ) ∈ Training

examples

1. Feeding of the request vector

qp to the input layer

2. forward propagation:

3.

For all layers, upwards, starting from the rst hidden For all neurons of the layer P Compute activation xj = f ( ni=1 wji xi ) Compute the squared error Ep (w)

4. backward propagation:

Until

For all layers, downwards, starting from the last For all weights wji wji = wji + ηδjp xip w converges or time limit reached

Remarks I I I

I

Non-linear mappings can be learned. Without a hidden layer, only linear mappings can be learned. Multiple-layered networks with only linear activation functions can just learn linear mappings variable sigmoid function

f (x) =

I

I

1 1 + e −(x−Θ)

.

used with threshold Θ. Same as with the perceptron, to each layer a bias neuron is added that constantly has activation 1, and has connections to all neurons of the next upper layer. The weights of these connections are trained normally, which represent the threshold Θ of the successor neurons.

NETtalk: A network learns to speak

Sejnovski and Rosenberg 198637 A system learning to read aloud and understandable.

37 T.J. Sejnowski und C.R. Rosenberg.

to read aloud.

NETtalk: a parallel network that learns

Techn. Ber. JHU/EECS-86/01. Wiederabdruck in [AR88] S.

661-672. The John Hopkins University Electrical Engineering und Computer Science Technical Report, 1986.

accenteddeep

central

26 neurons

80 neurons

... abc...z_,.

... abc...z_,.

... abc...z_,.

... abc...z_,.

... abc...z_,.

... abc...z_,.

... abc...z_,.

7 x 29 = 203 neurons

the_father_is Demo:Nettalk

(http://www.cnl.salk.edu/ParallelNetsPronounce/index.php)

I I I

I I I

I I

I

The network has been trained with 1000 words For every character, the stress has been specied as output. For converting the stress attributes to real tones, a part of the speech-synthesiser system DECtalk has been used. The network consists of 203 × 80 + 80 × 26 = 18320 weights. On VAX 780, 50 cycles for all words have been trained. With 5 characters per word: 5 · 50 · 1000 = 250 000 iterations for the backpropagation algorithm. 69 hours of computing time Human properties: From the beginning, only simple unclear words. Later: correctness of 95%.

Learning of heuristics for theorem prover

I

I I I

The semantic web requires automatic theorem provers for responding to search requests. For proving, great number of possible inference steps. Combinatorial explosion of the search space Heuristic search: e.g. A? -algorithm

I heuristic evaluation function???

Search tree of a prover

XX XXX X X @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

: positive training data

: negative training data : unused

The Training I I I

I I I

evaluating multiple alternative for the next step select the alternative having the best benchmark Resolution: Evaluation of available clauses based on attributes such as: number of literals, number of positive literals, complexity of a term, etc. Positive and negative training data On this data, a backpropagation network is trained With this learned network, the prover is much faster (see38 ,39 )

38 W. Ertel, J. Schumann und Ch. Suttner. Learning Heuristics for a Theorem Prover using Back Propagation. In:

Articial-Intelligence-Tagung.

5. Österreichische

Hrsg. von J. Retti und K. Leidlmair. Berlin,

Heidelberg: Informatik-Fachberichte 208, Springer-Verlag, 1989, S. 8795.

39 Ch. Suttner und W. Ertel. Automatic Acquisition of Search Guiding

Heuristics. In:

10th Int. Conf. on Automated Deduction. Springer-Verlag,

449, 1990, S. 470484.

LNAI

Problems and Improvements I I I

local minima of the error function convergence of backpropagation often slowly momentum term in learning rule:

∆p wji (t) = ηδjp xip + γ∆p wji (t − 1)

I

I

I

Minimizing of linear error function instead of quadratic (L1 -Norm) quadratic approximation of the error function (Fahlmann): Quickprop Adaptive increment control (Riedmiller): RProp

Support Vector Machines

Advantage of linear neural networks: I fast learning I Convergence guarantee I low danger for overtting Advantages of non-linear networks (e.g. backpropagation): I complex functions can be learned Disadvantage of non-linear networks (e.g. backpropagation): I local minima, convergence problems, overtting

Solution: Support-Vector-Machines (SVM) 1. non-linear transformation of the data such that transformed data is linear separately. This transformation is called a kernel. 2. In the transformed space, the support vectors are determined. −

x2

− − − + + + + + +

− − − −

+

−

+ +

+

−

x1

Linear Separation of the classes:

If data is consistend, it is always possible to generate linear separately data by transforming the vector space of the classes.40 e.g. with the transformation of the data by introducing a new dimension xn+1 : 1 if x ∈ Class 1 xn+1 = 0 if x ∈ Class 0.

40 A data point is inconsistent if it belongs to both classes.

x2 −

−

+ − − − − + + − + − − + + + + + + + +

x1

−

−

−

x3

−

− x2

− − − − + +

+

+ +

+ + +

+ +

+

x1 It's still not that easy. Why?

I

SVMs are also applicable to regression problems

Literature:

I

S. Schölkopf und A. Smola.

Learning with Kernels: Support

Vector Machines, Regularization, Optimization, and Beyond.

I

I

MIT Press, 2002 E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004 C. J. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Min. Knowl. Discov. 2.2 (1998), S. 121167. doi: http://dx.doi.org/10.1023/A:1009715923555

Applications of neural networks I I I I I

I I I I I I I

In all areas of industry All kinds of pattern recognition Analysis of photographs for recognizing people or faces Recognition of sh swarms on sonar echos Recognition an classication of military air planes on radar scans Recognition of speech and handwritten texts Robot control Heuristic search control in backgammon or chess-computers Application to reinforcement learning (chapter 10) Forecasting of stock prices Evaluation of credit worthiness of bank customers ...

Summary and Outlook I I I

I

Perceptron, delta rule and backpropagation Related to scores, naive bayes and method of least squares Hopeld networks are very demonstrative, but hard to handle in practice In practice important: associative memory models: Kohonen, Palm

Applies for all NN models: I

I

I I

Information is distributively memorized over many weights (holographic). Dying of single neurons has almost no impact to the functioning of the brain. Networks are robust against small disturbances. Recognition of defect patterns.

Disadvantages: I I

I I

It is hard to localize information. In practice, it's impossible to analyze /or to understand the weights trained in the network ( ⇒ decision trees) The Grandmother neuron is hard to nd Combining NNs with symbol-processing systems is problematic.

Not examined: I I I I

self-organizing maps (Kohonen) incremental learning sequential networks for the learning of time-dependent processes Learning without a teacher (see chapter 10)

Literature: I

I

I

A. Zell. Simulation Neuronaler Netze. Im Buch beschriebener Simulator SNNS, bzw. JNNS: www-ra.informatik.uni-tuebingen.de/SNNS. Addison Wesley, 1994, H. Ritter, T. Martinez und K. Schulten. Neuronale Netze. Addison Wesley, 1991, R. Rojas. Theorie der neuronalen Netze. Springer, 1993.

Collection of important original work in I J. Anderson und E. Rosenfeld. Neurocomputing: Foundations of Research. Sammlung von Originalarbeiten. Cambridge, MA: MIT Press, 1988, I J. Anderson, A. Pellionisz und E. Rosenfeld. Neurocomputing (vol. 2): directions for research. Cambridge, MA, USA: MIT Press, 1990

Journals I Neural Networks I Neurocomputing Conference: NIPS (Neural Information Processing Systems)

Teil X

Reinforcement Learning

Introduction

I I I

Robotic Tasks are often complex not programmable

Task I I I I

learning by trial-and-error which actions are In this way, humans learn how to walk reward for forward movements punishment for fallings

good

Negative Reinforcement

Next time swing earlier? Ski slower? Learning by punishment.

Positive Reinforcement

The crawler

gy

gx

−5 −4 −3 −2 −1 0

1

2

3

4

5

x

The crawler

10 000000000000000000000000 111111111111111111111111 10 −5 −4 −3 −2 −1 0

1

2

3

4

5

The crawler

−5 −4 −3 −2 −1 0

1

2

3

4

5

The crawler

−5 −4 −3 −2 −1 0

1

2

3

4

5

The crawler

−5 −4 −3 −2 −1 0

1

2

3

4

5

The crawler

10 000000000000000000000000 111111111111111111111111 10 −5 −4 −3 −2 −1 0

1

2

3

4

5

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The walking robot

−1 0

1

2

3

x

The state space

−1 0

1

2

x

3

−1 0

1

2

x

3

← → ↑↓ ↑↓

← →

−1 0

1

2

3

x

−1 0

1

2

3

x

The state space

li. re. ob. ← → ↑↓ ↑↓ unt. ← →

1 2 3 4 ← ← ← 1 → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 2 ← → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 3 ← → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 4 ← → → →

1 2 3 4 1 ↓ ↓ ↓ ↓ 2 → → → ↑ ↓ 3 ← ← ← ↑ ↑ ↑ 4 ←

State space: 2 x 2 (left), 4 x 4 (middle), optimal strategy (right).

The agent

agent

state s action a

environment

The agent

reward r

agent

state s action a

environment

The learning task State

st ∈ S :

at

st −→ st+1 Transition function

δ:

st+1 = δ(st , at )

Immediate reward

rt = r (st , at ) rt > 0: positive reward rt = 0: no reward rt < 0: negative reward Policy

π:S→A

often for a long time rt = 0!

A policy is optimal if it maximizes the long-term reward

Discounted reward Value function: π

2

V (st ) = rt + γrt+1 + γ rt+2 + . . . =

∞ X

γ i rt+i .

(56)

i=0

Alternative:

h

1X rt+i . V (st ) = lim h→∞ h i=0 π

A policy π ? is

optimal,

in case for all states s ?

V π (s) ≥ V π (s) Acronym: V ? = V π

?

(57)

(58)

Decision processes

I Markov decision process

I

The reward of an action only depends on the current state and current action POMDP (engl. partially observable markov decision process): the agent's state is not fully known

Uninformed combinatory search

2

3

3

2

3

4

4

3

3

4

4

3

2

3

3

2

2

3

3

3

2

3

4

4

4

3

3

4

4

4

3

3

4

4

4

3

Grid

2 2

2 2

2 3 2

3 4 3

2 3 2

# states

4

9

16

25

# policies

24 = 16

24 34 4 = 5184

24 38 44 ≈ 2.7 · 107

24 312 49 ≈ 2.2 · 1012

2

3

3

3

2

Uninformed combinatory search

common: I 4 corner nodes with 2 possible actions I 2(nx − 2) + 2(ny − 2) side nodes with 3 possible actions I (nx − 2)(ny − 2) inner nodes with 4 possible actions thus: 24 32(nx −2)+2(ny −2) 4(nx −2)(ny −2) possible policies

Value of states

π1 :

s0 ↓ ↓ ↓ ↓

→ → → ↑ ↓

← ← ← ↑ ↑ ↑ ↑

π2 :

s0 ← ← ↓ ↓

→ → ↑ ↓ ↓

← ← ← ↑

→ → →

Movement to the right is rewarded by 1, movement to the left is punished by -1 average speed is π1 : 3/8 = 0.375, average speed π2 : 2/6 ≈ 0.333

V π (st ) = rt + γrt+1 + γ 2 rt+2 + . . . γ 0.9 0.8375 0.8 V π1 (s0 ) 2.81 1.38 0.96 V π2 (s0 ) 2.66 1.38 1.00 greater γ : greater time horizon for the evaluation of policies!

Value iteration and dynamic programming Dynamic Programming, Richard Bellman, 1957 [Bel57]: Independently from the start state st , and the rst action at , all decisions of possible successor states st+1 must be optimal. global optimal policy by local optimization!

Wanted: optimal policy π ? that satises

V π (st ) = rt + γrt+1 + γ 2 rt+2 + . . . =

∞ X

γ i rt+i .

i=0

and

?

V π (s) ≥ V π (s)

Therefore

V ? (st ) =

max

at ,at+1 ,at+2 ,...

(r (st , at )+γ r (st+1 , at+1 )+γ 2 r (st+2 , at+2 )+. . .).

r (st , at ) depends only on st and at , thus V ? (st ) = max[r (st , at ) + γ

max (r (st+1 , at+1 ) + γ r (st+2 , at+2 ) + . . .

at+1 ,at+2 ,...

at

(60) ?

= max[r (st , at ) + γV (st+1 )]. at

Bellman-Equation:

(xpoint equation)

V ? (s) = max[r (s, a) + γV ? (δ(s, a))].

(62)

π ? (s) = argmax[r (s, a) + γV ? (δ(s, a))].

(63)

a

thus

(61)

a

Iteration instruction:

(xpoint iteration)

Vˆ (s) = max[r (s, a) + γ Vˆ (δ(s, a))] a

Initialization: ∀s Vˆ (s) = 0

(64)

Value Iteration()

s∈S ˆ V (s) = 0

For all

Repeat

s∈S ˆ V (s) = maxa [r (s, a) + γ Vˆ (δ(s, a))] Vˆ (s) converged

For all Until

Theorem Value iteration converges to

V?

[SB98].

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 1

0.73 −1 1.66 −1 2.49

0

0 0

0

0 0

1

0

0 0

0

0 0

1.9

−1

0 0

1.39 0 2.02 0 2.62 0

0 0

1

0 1

1

−1

2.66 0 2.96 0 3.28

0 0

0

0

1.39 0 1.54 0 2.24

0

0

0 0

0

0 0

1

1.9

−1

1.25 0 1.82 0 2.36 0

0 0

0 0

0 1

1

−1

0

0 0

0

0 1

0

0.81 0 1.54 0 1.71 0

0

0

0

−1

1.25 0 1.39 0 2.02 0

0

0 1

0

−1

0 0

0

1.71

0

0 0

0

0

0.9

0

0

0 1

0

0

0

0

0

0

0

0

0

0

0.73 0 1.39 0 1.54 0

0

0

0

0 0

0.81 0 1.54

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 1

1.25 −1 2.12 −1 2.91

...

0 0

2.96 0 3.28 0 3.65 0

0 0

0 0

1

0 1

2.66 −1 3.39 −1 4.05

V*

Value iteration with γ = 0.9 and two optimal policies. Chosing an action that leads to the state with highest ? V -value is wrong. Why? Attention:

0

0

2.66 0 2.96 0 3.28 0

Application to s = (2, 3) in

0

0

0

0

0

0

0

2.96 0 3.28 0 3.65 0

0

0

0

0

1

0

1

2.66 −1 3.39 −1 4.05

V*

π ? (2, 3) =

argmax

[r (s, a) + γV ? (δ(s, a))]

a∈{left,right,up}

=

argmax {1 + 0.9 · 2.66, −1 + 0.9 · 4.05, 0 + 0.9 · 3.28}

{left,right,up}

=

argmax {3.39, 2.65, 2.95}

{left,right,up}

= left

The hardware walking robot

Demo:Walking robot

Unknown environment What to do if the agent has no model of its actions?

Vˆ (s) = max[r (s, a) + γ Vˆ (δ(s, a))] a

0

0

2.66 0 2.96 0 3.28 0

0 0

0 0

0

0 0

2.96 0 3.28 0 3.65 0

0 0

0 0

1

0 1

2.66 −1 3.39 −1 4.05

V*

Today's challenge: Go

I I I I I I I

square board with 361 squares 181 white, 180 black stones average branching factor: about 300 after 4 half-moves: 8 · 109 positions classical game tree search processes have no chance! Pattern recognition on the board! Humans are still miles ahead of computer programs.

Breakthrough: GO-program beats human Go Champion I I I I I I I I I

I

No Combinatorial search (such as minimax search)! Pattern matching based on Deep Learning. Deep Convolution Neural Network (DCNN). DCNN predicts the next k moves. Image processing on 19 × 19 board. Monte Carlo Tree search (MCTS). Up to 110,000 rollouts. Wins consistently against commercial Go engines. Tian, Y. & Zhu, Y. Preprint at http://arxiv.org/pdf/1511.06410.pdf (2015). Elizabeth Gibney. Google AI algorithm masters ancient game of Go. Nature 529, 445-446 (28 January 2016)

Q-learning

Evaluation function Q(st , at )

π ? (s) = argmax Q(s, a). a

(65)

Discounting of future rewards and maximizing of

rt + γrt+1 + γ 2 rt+2 + . . . Evaluation of action at in state st :

Q(st , at ) =

max (r (st , at )+γr (st+1 , at+1 )+γ 2 r (st+2 , at+2 )+. . .).

at+1 ,at+2 ,...

(66)

Simplication:

Q(st , at ) =

max (r (st , at ) + γ r (st+1 , at+1 ) + γ 2 r (st+2 , at+2 ) + . . .)

at+1 ,at+2 ,...

(67)

= r (st , at ) + γ

max (r (st+1 , at+1 ) + γr (st+2 , at+2 ) + . . .)

at+1 ,at+2 ,...

(68)

= r (st , at ) + γ max(r (st+1 , at+1 ) + γ max(r (st+2 , at+2 ) + . . .)) at+1

at+2

(69)

= r (st , at ) + γ max Q(st+1 , at+1 )

(70)

= r (st , at ) + γ max Q(δ(st , at ), at+1 )

(71)

= r (s, a) + γ max Q(δ(s, a), a0 ) 0

(72)

at+1

at+1

a

Fixed point equation

is solved iteratively by:

ˆ a) = r (s, a) + γ max Q(δ(s, ˆ Q(s, a), a0 ) 0 a

(73)

The Algorithm for Q-Learning Q-Learning()

s ∈ S, a ∈ A ˆ a) = 0 (or randomly) Q(s,

For all

Repeat

Select (e.g. randomly) a state s Repeat

Select an action a and carry it out Obtain reward r and new state s 0 ˆ a) := r (s, a) + γ maxa0 Q(s ˆ 0 , a0 ) Q(s, s := s 0 Until s is a terminal state Or time limit reached ˆ converges Until Q

Application of the algorithm with and

nx = 3, ny = 2 0

0

0

0

0

0 0

0 0

0.73

0 1

−0.1

1.9

−0.1

1.25

1.9 0.71

−0.1

1.71 1.54

0.81 2.49 0.71

.....

−0.1

0.81 1.9 −0.1

−0.1

0

1.39 1.54

1.39 0.9

−0.09 0.73

1.71 1.54

0.81 1.9

1.66 0.49

0 1

0.73

0.81 1.71

1.66

1.39

2.13 0.49

0.73

0 0

1

1.54 0.9

1.54 1.49

−0.09 1.54

0.81 1.66

0

−0.1

1.39 1.71

1.49 0.73

1.9

0.73

1.39 1.54

1.39

0.81 1

−0.1

0.73

0.73

0

0

−1

1.39 0

1.71

0 0.9

1

0.73

0.81 1.54

0.9

0

1 −1

0

0 0

0

−1

0.81 0

0.9

0

0.81

0.81

0 0.9

0

1

0.73

0.81 0.9

0

−1

0.81

0.73

0

1

0

0.81

0.49

0

0

0

−0.09

0

0.73

0

0.81

0

0 0

0

0

0

0

0

0

γ = 0.9

0.71

Theorem r (s, a) 0 ≤ γ < 1 is used for learning. Let ˆ n (s, a) be the value for Q(s, ˆ a) after n updates. If each Q ˆ n (s, a) state-action pair is visited innitely often, then Q converges to Q(s, a) for all values s and a for n → ∞. Let a deterministic MDP with limited immediate reward

be given. Equation ?? with

Proof: I I

I

All state-action transitions occur innitely often Look at intervals in which all state-action transitions occur, at least once. ˆ -table is reduced in each interval The maximum error in the Q by at least the factor γ :

Let

ˆ n (s, a) − Q(s, a)| ∆n = max |Q s,a

ˆ n and s 0 = δ(s, a). the maximum error in table Q ˆ n (s, a): For all table entries Q ˆ n (s 0 , a0 )) − (r + γ max Q(s 0 , a0 ))| ˆ n+1 (s, a) − Q(s, a)| = |(r + γ max Q |Q 0 0 a

a

ˆ n (s 0 , a0 ) − max Q(s 0 , a0 )| Q = γ | max 0 0 a

a

ˆ n (s 0 , a0 ) − Q(s 0 , a0 )| ≤ γ max |Q 0 a

ˆ n (s 00 , a0 ) − Q(s 00 , a0 )| = γ ∆n . ≤ γ max |Q 00 0 s ,a

The rst inequality holds, because for any function

f

and

g

| max f (x) − max g (x)| ≤ max |f (x) − g (x)| x

x

x

and the second inequality is true because, by additional variation of the state

s 00 ,

the resulting maximum cannot become smaller.

Thus

∆n+1 ≤ γ∆n

and

∆k ≤ γ k ∆0

thus:

lim ∆n = 0

n→∞

Remarks

I

I

Q-learning converges according to Satz ?? independently of the chosen actions during learning. The speed of convergence depends on the actions chosen during learning.

Q-learning in non-deterministic environments

Non-deterministic agent: Response from the environment for taking action a in state s is non-deterministic. X Q(st , at ) = E (r (s, a)) + γ P(s 0 |s, a) max Q(s 0 , a0 ), (74) 0 s0

I I

a

Convergence guarantee for Q-learning is lost! Reason: The same action a in the same state s leads to totally dierent responses from the environment

New learning rule

ˆ n (s, a) = (1 − αn )Q ˆ n−1 (s, a) + αn [r (s, a) + γ max Q ˆ n−1 (δ(s, a), a0 )] Q 0 a

with temporal-variable weight factor

αn = I

I I

(75)

1 . 1 + bn (s, a)

bn (s, a) = number of times action a has been chosen in state s before iteration n. ˆ n−1 (s, a). stabilizing term Q values bn (s, a) for all state-action pairs must be memorized.

TD-Error and TD-Learning αn = α (constant): ˆ n−1 (δ(s, a), a0 )] ˆ n (s, a) = (1 − α)Q ˆ n−1 (s, a) + α[r (s, a) + γ max Q Q 0 a

ˆ n−1 (δ(s, a), a0 ) − Q ˆ n−1 (s, a)] ˆ n−1 (s, a) + α [r (s, a) + γ max Q = Q a0 | {z } TD-Error

I I I

α = 1: Q-learning α = 0: No learning takes place 0 < α < 1: ???

Exploration and exploitation Q-learning()

s ∈ S, a ∈ A ˆ a) = 0 (or randomly) Q(s,

For all

Repeat

Chose (e.g. randomly)

a (which?)

state s

Repeat

Chose a (which?) action a and execute Receive reward r and successor state s 0 ˆ a) := r (s, a) + γ maxa0 Q(s ˆ 0 , a0 ) Q(s, s := s 0 Until s is terminal state Or time limit reached ˆ converges Until Q

Possibilities for choosing the next action

random choice:

leads to equal exploration of all possible actions I very slow convergence ˆ -value:) choosing the best action (highest Q I optimal exploitation of learned policy I relatively fast convergence I non-optimal policies can be learned I

Selection of the start state

0

0

0

0

0

0 0

0 0

0.73

0 1

−0.1

0.73

0.73 −0.1

0.73

1.54

0.81

1.25

1.9 0.71

−0.1

1.71 1.54

0.81 2.49 0.71

.....

−0.1

0.81 1.9 −0.1

−0.1

0

1.39 1.54

1.39 0.9

−0.09 0.73

1.71 1.54

0.81 1.9

1.66 0.49

0 1

0.73

0.81 1.71

1.66

1.39

2.13 0.49

0.73

0 0

1

1.54 0.9

1.54 1.49

−0.09

0

−0.1

1.39 1.71

1.66

1.9

0.73

1.39 1.54

1.49

0.81 1

0

0

−1

1.39 0

1.71

0 0.9

1

0.73

0.81 1.54

0.9

0

−0.1

1.39

0.49

1.9

1 −1

0

0 0

0

−1

0.81 0

0.9

0

0.81

0.81

0 0.9

0

1

0.73

0.81 0.9

0

−1

0.81

0.73

0

1

0

0.81

−0.09

0

0

0

0

0

0.73

0

0.81

0

0 0

0

0

0

0

0

0.71

Function approximation, Generalization and Convergence I I

continuous state variables ⇒ innite state space Table with V − or Q−values can not be stored explicitly

Solution: I I I I

Q(s, a)-table is replaced by a neural network with Input-variables s , a and Q -value as output. Finite representation of (innite) function Q(s, a)! Generalization (from nite training samples)

Attention: No convergence guarantee any more because Satz only holds if all state-action pairs are visited innitely often. Alternative: any other function approximator

??

POMDP

POMDP:

I I I I

partially observable Markov decision process:

many dierent states are recognized as one particular state. many states in the real world are mapped to a observation. Convergence problem with value iteration or Q-learning Possible solutions: [SB98], Observation Based Learning [LR02].

Application: TD-Gammon

I

I I

I I I

TD-Learning (Temporal Dierence Learning) utilizes states that are more far in the future TD-Gammon: a program for playing backgammon TD-Learning using a backpropagation network with 40 to 80 hidden neurons The only reward: Scoring at the end of the game TD-Gammon was trained within 1.5 million games against itself. Beat's Backgammon grandmaster!

Other applications

I

I I

RoboCup: with reinforcement learning a policy for the robot is learned, e.g. dribbling [SSK05; Rob]. Inverse pendulum Control of a Quadrocopter

Problems in robotics: I

I I

extreme computation times in higher dimensional problems (many variables/actions) Feedback of the environment on real robots is very slowly. Better, faster learning algorithms are required.

Landing of air planes

[Russ Tedrake, IROS 08]

Birds don't solve Navier-Stokes!






Curse of dimensionality

I

Problem: high dimensional state- and action spaces

Solution methods: I I

I I I I

Learning in nature happens on many abstraction layers Computer Science: every learned skill is encapsulated in a module Action space is scaled down States are abstracted Hierachical learning [BM03] distributed learning (Centipede, a brain for each leg)

Curse of dimensionality, other ideas

I I

I

Human brain is at birth no tabula rasa Good initial policy for a robot? I

1. Classical programming.

I

2. Reinforcement learning

I

3. Trainer oers additional feedback

or: I

1. Learning from demonstration (learning with a teacher)

I

2. Reinforcement learning [Bil+08]

I

3. Trainer oers additional feedback

Current state of research

I I I I I

Fitted Value Iteration Connecting reinforcement learning with imitation learning Policy Gradient Methods Actor Critic Methods Natural Gradient Methods

Fitted Value Iteration

3:

Randomly sample m states from the MDP Ψ=0 n = the number of available actions in A

4:

repeat

5:

for

1: 2:

6: 7:

i = 1 → m do for j = 1 → n do

q(a) = R(s (i) ) + γV (s (j) )

8:

end for

9:

y (i) = maxj q(a)

10: 11: 12:

end for

P 2 (i) Ψ = arg minΨ m − ΨT Φ(s (i) )) i=1 (y until Ψ Converges

www.teachingbox.org

I

reinforcement learning algorithms: I

value iteration

I

Q(λ), SARSA(λ)

I

TD(λ)

I

tabular and function approximation versions

I

actor critic

I

tile coding

I

locally weighted regression

I

Example Environments: I

mountain car

I

gridworld (with editor), windy gridworld

I

dicegame

I

n armed bandit

I

pole swing up

Literatur

I

I

I

First introduction: T. Mitchell. Machine Learning. www-2.cs.cmu.edu/~tom/mlbook.html. McGraw Hill, 1997 Standard work: R. Sutton und A. Barto. Reinforcement Learning. www.cs.ualberta.ca/~sutton/book/the-book.html. MIT Press, 1998 Overview: L.P. Kaelbling, M.L. Littman und A.P. Moore. Reinforcement Learning: A Survey. In: Journal of Articial Intelligence Research 4 (1996). www-2.cs.cmu.edu/afs/cs/ project/jair/pub/volume4/kaelbling96a.pdf, S. 237285