Oct 31, 2013 ... Slides for the book. Introduction to Artificial Intelligence. Wolfgang Ertel. Springer-
Verlag, 2011 www.hs-weingarten.de/~ertel/aibook.
Slides to the book
Introduction to Articial Intelligence Wolfgang Ertel Springer-Verlag, 2011
www.hs-weingarten.de/~ertel/kibuch www.vieweg.de
10. April 2017
Teil I
Introduction What is Articial Intelligence (AI) History of AI Agents Knowledge-Based Systems
What is Articial Intelligence (AI)
I I I I I I I
What is intelligence? How can intelligence be measured? How does our brain work? Intelligent machine? Science ction? Rebuild human mind? Philosophy, e.g. mind-body dualism?
AI: Denition John McCarthy (1955): The aim of AI is to develop machines that behave as if they were intelligent.
Two simple Braitenberg-vehicles and their reaction to a light source.
AI: Denition
Encyclopedia Britannica: AI is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings. According to this denition, every computer is an AI-system.
AI: Denition
Elaine Rich1 : Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better. I I
I
Still up-to-date in the year 2050! Humans are still better in many elds (e.g. understanding pictures, learning)! Computers are already better in many elds (e.g. playing chess)!
1 E. Rich.
Articial Intelligence.
McGraw-Hill, 1983.
Brain Research and Problem Solving
Dierent approaches: I How does the human brain work? I Problem oriented: building intelligent agents! I General store!
A small excerpt in the oer of AI-methods.
The Turing Test and Chatterbots Alan Turing:
The machine passes the test, if it can mislead Alice in 30% of the cases.
Joseph Weizenbaum (computer critic): the program Eliza talks to his secretary Demo: Cleverbot Demo: Simonlaven Demo: Alicebot
History of AI I 1931
The Austrian Kurt Gödel shows that in rst-order predicate logic all true statements are derivable . In higher-order logics, on the other hand, there are true statements that are unprovable .
1937
Alan Turing points out the limits of intelligent machines with the halting problem .
1943
McCulloch und Pitts model neural networks and make the connection to propositional logic.
1950
Alan Turing denes machine intelligence with the Turingtest and writes about learning machines and genetic algorithms
1951
Marvin Minsky develops a neural network machine. With 3000 vacuum tubes he simulates 40 neurons.
History of AI II 1955
Arthur Samuel (IBM) builds a learning chess program that plays better than its developer .
1956
McCarthy organizes a conference in Dartmouth College. Here the name Articial Intelligence was rst introduced. Newell and Simon of Carnegie Mellon University (CMU) present the Logic Theorist, the rst symbol-processing computer program .
1958
McCarthy invents at MIT (Massachusettes Institute of Technology) the high-level language LISP. He writes programs that are capable of modifying themselves.
1959
Gelernter (IBM) builds the Geometry Theorem Prover.
1961
The General Problem Solver (GPS) by Newell und Simon imitates human thought .
History of AI III 1963
McCarthy founds the AI Lab at Stanford University.
1965
Robinson invents the resolution gic (Sec. ).
1966
Weizenbaum's program Eliza carries out dialogue with people in natural language (Sec. ).
1969
Minsky and Papert show in their book Perceptrons that the perceptron, a very simple neural network, can only represent linear functions (Sec. ).
1972
French scientist Alain Colmerauer invents the logic programming language PROLOG (Sec. ).
calculus
for predicate lo-
British physician de Dombal develops an expert system for diagnosis of acute abdominal pain . It goes unnoticed in the mainstream AI community of the time (Sec. ).
History of AI IV 1976
Shortlie and Buchanan develop MYCIN, an expert system for diagnosis of infectious diseases, which is capable of dealing with uncertainty (Ch. ).
1981
Japan begins, at great expense, the Fifth Generation Project with the goal of building a powerful PROLOG machine.
1982
R1, the expert system for conguring computers, saves Digital Equipment Corporation 40 million dollars per year .
1986
Renaissance of neural networks through, among others, Rumelhart, Hinton and Sejnowski . The system Nettalk learns to read texts aloud. (Sec. ).
1990
Pearl , Cheeseman , Whittaker, Spiegelhalter bring probability theory into AI with Bayesian networks (Sec. ). Multi-agent systems become popular.
History of AI V 1992
Tesauros TD-gammon program demonstrates the advantages of reinforcement learning.
1993
Worldwide RoboCup initiative to build soccer-playing autonomous robots .
1995
From statistical learning theory, Vapnik develops support vector machines, which are very important today.
1997
First international RoboCup competition in Japan. IBM's Deep Blue beats chess world champion Gary Kasparov by a score of 3.52.5
2003
The robots in RoboCup demonstrate impressively what AI and robotics are capable of achieving.
2006
Service robotics becomes a major AI research area.
History of AI VI 2009
First Google self driving car on a freeway in California.
2010
Autonomous robots start learning their policies.
2011
The IBM-Software Watson beats two human Masters in the US TV-show Jeopardy!. Watson understands natural language and answers very fast (Sec. ).
2015
Daimler presents rst autonomous truck on german highway. Google Self Driving Cars have logged 1.6 million kilometers on public roads and in cities. Deep learning (Sec. ) achieves very good image classication performance. With deep learning paintings in the style of former painters like Picasso can be automatically generated. AI goes creative!
History of AI VII
2016
GO-program beats European champion 5:0, based on deep learning for pattern recognition, reinforcement learning and Monte Carlo tree search.
Phases of AI history
I I I I I I
The rst beginnings Logic solves all problems The new connectionism Reasoning with uncertainty Distributed, autonomous and learning agents AI has grown up
power of representation Gödel
Turing
Dartmouth−conference resolution LISP
first−order logic
GPS
PROLOG
Jaynes probabilistic reasoning
symbolic
automated theorem provers PTTP, Otter, SETHEO, E−prover heuristic search Bayesian nets
decision tree learning
Hunt
ID3, CART
C4.5
Zadeh fuzzy logic propositional logic Davis/Putnam
deep belief networks
hybrid systems neural networks numeric
neuro− hardware
Minsky/Papert book
back−propagation
deep learning
support vector machines 1930
1940
1950
1960
1970
1980
1990
2000
2010
year
AI and Society
or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I AI kills jobs!
¤bische Zeitung, 19.1.2016
SchwÃ
AI and Society
or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I AI kills jobs! I Economic growth
creates new jobs!
The Economy must Grow!
Who earns the prots?
I I
Capitalists! Banks!
AI and Society
or: The purpose of AI I AI saves energy I AI saves time I AI saves money I AI increases productivity I Robots do the work! I We humans work less!
Stephen Hawking on reddit.com: Professor Hawking Whenever I teach AI, Machine Learning, or Intelligent Robotics, my class and I end up having what I call The Terminator Conversation. My point in this conversation is that the dangers from AI are overblown by media and non-understanding news, and the real danger is the same danger in any complex, less-than-fully-understood code: edge case unpredictability. In my opinion, this is dierent from dangerous AI as most people perceive it, in that the software has no motives, no sentience, and no evil morality, and is merely (ruthlessly) trying to optimize a function that we ourselves wrote and designed. Your viewpoints (and Elon Musk's) are often presented by the media as a belief in evil AI, though of course that's not what your signed letter says. Students that are aware of these reports challenge my view, and we always end up having a pretty enjoyable conversation. How would you represent your own beliefs to my class? Are our viewpoints reconcilable? Do you think my habit of discounting the layperson Terminator-style evil AI is naive? And nally, what morals do you think I should be reinforcing to my students interested in AI?
Answer:
You're right: media often misrepresent what is actually said. The real risk with AI isn't malice but competence. A superintelligent AI will be extremely good at accomplishing its goals, and if those goals aren't aligned with ours, we're in trouble. You're probably not an evil ant-hater who steps on ants out of malice, but if you're in charge of a hydroelectric green energy project and there's an anthill in the region to be ooded, too bad for the ants. Let's not place humanity in the position of those
Please encourage your students to think not only about how to create AI, but also about how to ensure its benecial use.
ants.
Stephen Hawking on reddit.com:
If machines produce everything we need, the outcome will depend on how things are distributed. Everyone can enjoy a life of luxurious leisure if the machine-produced wealth is shared, or most people can end up miserably poor if the machine-owners successfully lobby against wealth
So far, the trend seems to be toward the second option, with technology driving ever-increasing inequality. redistribution.
Why must the Economy grow?
I I I
The amount of money grows too fast! Reason #1: Money creation by private banks Reason #2: The interest system
The solution
I I I
Public money 2 3 Natural economic order 4 Tax reform: Remove income tax, create energy tax
2 Joseph Huber: Vollgeld 3 www.monetative.de
4 Margrit Kennedy: Occupy Money 5 Reiner Kümmel: The Second Law of Economics
5
Agents
Software agent
input software agent
user output
Hardware agent (autonomous robot)
hardware−Agent
sensor 1
...
perception
sensor n
software− agent
environment actuator 1
...
actuator m
manipulation
function from the set of all inputs to the set of all outputs. Markov decision process: only the current state is needed for the determination of the optimal action. Agent with memory: is not a function. Why? Reex-Agent:
Agent capable of learning Distributed agents goal-oriented agent
Example Spam lter: I
aims at assigning emails to their correct classes.
Agent 1: correct class desired SPAM Spam lter decides
I
desired SPAM
189 11
1 799
Agent 2: correct class desired SPAM Spam lter decides
I
desired SPAM
200 0
Which agent is better?
38 762
Cost-Oriented Agent
Denition The goal of a cost-oriented agent is to minimize the long-term cost (i.e. the average cost) caused by wrong decisions. The sum of all weighted errors results in the total cost.
Example Appendicitis diagnosis system LEXMED (see
Sec.
)
Environment
I I I I I I
observable (chess computer) partially observable (robot) deterministic (8-puzzle) non-deterministic (chess computer, robot) discrete (chess computer) continuous
Knowledge-Based Systems
strict separation of:
Knowledge
Inference mechanism
Knowledge base (KB)
Structure of a knowledge based system
knowledge sources
knowledge acquisition
data
knowledge− processing
user
knowledge engineer expert
knowledge engineering query
knowledge base
KB data− base
environment
machine learning
inference answer
Knowledge Engineering
Hot to get the knowledge into the AI system?
1. Knowledge engineer
2. Machine learning
Separation of knowledge and inference
Advantages: I
I
Inference is application-independent (e.g. medical expert system). Knowledge can be stored declaratively.
Knowledge Representation with formal languages
I I I I I
Propositional logic First-order logic (shortly: FOL). Probabilistic logic Fuzzy logic Decision trees
Teil II
Propositional Logic Syntax Semantics Proof systems Resolution Horn clauses Computability and Complexity
if it is raining the street is wet.
Written more formally it is raining
⇒
the street is wet.
Syntax
Denition Let Op = {¬, ∧ , ∨ , ⇒ , ⇔ , ( , ) } be the set of logical operators and Σ a set of symbols. The sets Op, Σ and {t, f } are pairwise disjoint. Σ is called the signature and its elements are the proposition variables. The set of propositional logic formulas is now recursively dened:
• t and f are (atomic) formulas.
• All proposition variables, that is all elements from Σ, are (atomic) formulas. • If A and B are formulas, then ¬A, (A), A ∧ B, A ∨ B, A ⇒ B, A ⇔ B are also formulas.
Denition We read the symbols and operators in the following way:
t f ¬A A∧B A∨B A ⇒ B A ⇔ B
: : : : : : :
true false not A A and B A or B if A then B A if and only if B
(negation) (conjunction) (disjunction) (implication6 ) (equivalence)
With Σ = {A, B, C }, for example
A ∧ B,
A ∧ B ∧ C,
A ∧ A ∧ A,
(¬A ∧ B) ⇒ (¬C ∨ A)
C ∧ B ∨ A,
are also formulas. (((A)) ∨ B) is also a syntactically correct formula. The formulas dened in this way are so far purely syntactic constructions without meaning. We are still missing the semantics.
Semantics
Is the formula
true?
A∧B
Denition A mapping I : Σ → {w , f }, which assigns a truth value to every proposition variable, is called an assignment or interpretation or also a world. Every propositional logic formula with n dierent variables has 2n dierent interpretations.
Truth table
A B t t t f f t f f
(A) ¬A A ∧ B A ∨ B A ⇒ B A ⇔ B t f t t t t t f f t f f f t f t t f f t f f t t
The empty formula is true for all interpretations. Operator priorities: ¬, ∧ , ∨ , ⇒ , ⇔ .
Denition Two formulas F and G are called semantically equivalent if they take on the same truth value for all interpretations. We write F ≡ G . natural language, e.g. A ≡ B language: logic, e.g. A ⇔ B
Meta language: Object
Theorem The operations ∧ , ∨ are commutative and associative, and the following equivalences are generally valid: ¬A ∨ B A ⇒ B (A ⇒ B) ∧ (B ⇒ A) ¬(A ∧ B) ¬(A ∨ B) (A ∨ B) ∧ (A ∨ C ) A ∧ (B ∨ C ) A ∨ ¬A A ∧ ¬A A∨f A∨w A∧f A∧w
⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔ ⇔
A ⇒ B ¬B ⇒ ¬A (A ⇔ B) ¬A ∨ ¬B ¬A ∧ ¬B A ∨ (B ∧ C ) (A ∧ B) ∨ (A ∧ C ) w f A w f A
(implication) (contraposition) (equivalence) (De Morgan's law) (distributive law) (tautology) (contradiction)
Proof:
only the rst equivalence:
A B t t t f f t f f
¬A ¬A ∨ B A ⇒ B (¬A ∨ B) ⇔ (A ⇒ B) f t t t f f f t t t t t t t t t
The proofs for the other equivalences are similar and are
recommended as exercises for the reader (Aufgabe
??).
2
Varianten der Wahrheit According to how many interpretations in which a formula is true, we can divide formulas into the following classes:
Denition A formula is called if it is true for at least one interpretation.
•
satisable
•
logically valid
•
unsatisable
or simply valid if it is true for all interpretations. True formulas are also called tautologies. if it is not true for any interpretation.
Every interpretation that satises a formula is called a of the formula.
model
Clearly the negation of every generally valid formula is unsatisable. The negation of a satisable, but not generally valid formula F is satisable.
Does a formula Q follow from the
knowledge base
WB ?
Denition A formula WB entails a formula Q (or Q follows from WB ) if every model of WB is also a model of Q . We write WB |= Q . I I
I
I I
Semantic concept! Every formula chooses a subset of the set of all interpretations as its model. Tautologies such as A ∨ ¬A, for example, do not restrict the number of satisfying interpretations because their proposition is empty. The empty formula is therefore true in all interpretations. For every tautology T then ∅ |= T .
Truth table for implication:
A B A ⇒ B t t t t f f f t t f f t An arbitrary implication A ⇒ B is clearly always true except with the interpretation A 7→ t, B 7→ f . Assume that A |= B holds. This means that for every interpretation that makes A true, B is also true. The critical second row of the truth table doesn't even apply in that case. Therefore A ⇒ B is true, which means that A ⇒ B is a tautology. Thus one direction of the statement has been shown.
Theorem (Deduction theorem)
A |= B I
I
if and only if
|= A ⇒ B .
The truth table method is a proof system for propositional logic! Disadvantage: very large computing time in the worst case (2n interpretations).
If a formula WB entails a formula Q , then by the deduction theorem WB ⇒ Q is a tautology. Therefore the negation ¬(WB ⇒ Q) is unsatisable. We have:
¬(WB ⇒ Q) ≡ ¬(¬WB ∨ Q) ≡ WB ∧ ¬Q
Therefore WB ∧ ¬Q is also satisable. Thus:
Theorem
(proof by contradiction)
WB ∧ ¬Q
is unsatisable.
WB |= Q
if and only if
To show that the query Q follows from the knowledge base WB , we can also add the negated query ¬Q to the knowledge base and derive a contradiction.
Fields of application:
I I
In Mathematics In many automatic proof calculi, i.a. Resolution, PROLOG
Derivation:
Calculus:
syntactic manipulation of the formulas WB and Q by application of inference rules with the goal of greatly simplifying them, such that in the end we can instantly see that WB |= Q . syntactic proof system (derivation).
Syntactic derivation and semantic entailment
B
derivation
/
Q
syntac (form
interpretation
WB)
`
|=
entailment
/
(Q)
Mod
seman (interp
To keep automatic proof systems as simple as possible, these are usually made to operate on formulas in conjunctive normal form.
Denition A formula is in conjunctive normal if it consists of a conjunction
form (CNF)
if and only
K1 ∧ K2 ∧ . . . ∧ Km of clauses. A clause Ki consists of a
disjunction
(Li1 ∨ Li2 ∨ . . . ∨ Lini ) of literals. Finall, a literal is a variable (positive literal) or a negated variable (negative literal). The formula (A ∨ B ∨ ¬C ) ∧ (A ∨ B) ∧ (¬B ∨ ¬C ) is in conjunctive normal form. The conjunctive normal form does not place a restriction on the set of formulas because:
Theorem Every propositional logic formula can be transformed into an equivalent conjunctive normal form.
Example
Proof calculus: modus ponens
A, A ⇒ B . B Modus ponens is sound but not complete.
A ∨ B, ¬B ∨ C . A∨C
resp.
A ∨ B, B ⇒ C . A∨C
I I
The derived clause is called resolvent. The resolution rule is a generalization of the modus ponens.
Example
(1)
General resolution rule: (A1 ∨ . . . ∨ Am ∨ B), (¬B ∨ C1 ∨ . . . ∨ Cn ) . (A1 ∨ . . . ∨ Am ∨ C1 ∨ . . . ∨ Cn )
We call the literals B and ¬B
complementary.
Theorem The resolution calculus for the proof of unsatisability of formulas in conjunctive normal form is sound and complete.
WB must be consistent!. Otherwise anything can be derived from WB (see Aufgabe
The knowledge base
??).
(2)
Example Despite studying English for seven long years with brilliant success, I must admit that when I hear English people speaking English I'm totally perplexed. Recently, moved by noble feelings, I picked up three hitchhikers, a father, mother, and daughter, who I quickly realized were English and only spoke English. At each of the sentences that follow I wavered between two possible interpretations. They told me the following (the second possible meaning is in parentheses): The father: We are going to Spain (we are from Newcastle). The mother: We are not going to Spain and are from Newcastle (we stopped in Paris and are not going to Spain). The daughter: We are not from Newcastle (we stopped in Paris). What about this charming English family?
Three steps: I Formalization I Transformation into normal form I Proof (often very dicult) practising very important! (see tasks ????).
(S ∨ N) ∧ [(¬S ∧ N) ∨ (P ∧ ¬S)] ∧ (¬N ∨ P).
Factoring out ¬S in the middle sub-formula brings the formula into CNF in one step.
WB ≡ (S ∨ N)1 ∧ (¬S)2 ∧ (P ∨ N)3 ∧ (¬N ∨ P)4 . Now we begin the resolution proof (without a query Q ). Res(1,2) : (N)5 Res(3,4) : (P)6 Res(1,4) : (S ∨ P)7 Empty clause not derivable, thus KB is non-contradictory.
To show that ¬S holds, we add the clause (S)7 to the set of clauses as a negated query. Res(2,7) : ()8 Thus it holds ¬S ∧ N ∧ P . The charming English family evidently comes from Newcastle, stopped in Paris, but is not going to Spain.
Logic puzzle number 28 from [Ber89]:
The High Jump
reads:
Three girls practice high jump for their physical education nal exam. The bar is set to 1.20 meters. I bet, says the rst girl to the second, that I will make it over if, and only if, you don't. If the second girl said the same to the third, who in turn said the same to the rst, would it be possible for all three to win their bets?
We show through proof by resolution that not all three can win their bets.
Formalization:
The rst girl's jump succeeds: A, the second girl's jump succeeds: B , the third girl's jump succeeds: C . First girl's bet: (A ⇔ ¬B), second girl's bet: (B ⇔ ¬C ), third girl's bet: (C ⇔ ¬A). Claim: the three cannot all win their bets:
Q ≡ ¬((A ⇔ ¬B) ∧ (B ⇔ ¬C ) ∧ (C ⇔ ¬A)) It must now be shown by resolution that ¬Q is unsatisable.
Transformation into CNF:
First girl's bet:
(A ⇔ ¬B) ≡ (A ⇒ ¬B) ∧ (¬B ⇒ A) ≡ (¬A ∨ ¬B) ∧ (A ∨ B) The bets of the other two girls undergo analogous transformations, and we obtain the negated claim
¬Q ≡ (¬A ∨ ¬B)1 ∧ (A ∨ B)2 ∧ (¬B ∨ ¬C )3 ∧ (B ∨ C )4 ∧ (¬C ∨ ¬A)5 From there we derive the empty clause using resolution: Res(1,6) : Res(4,7) : Res(2,5) : Res(3,9) : Res(8,10) : Thus the claim has been proved.
(C ∨ ¬B)7 (C )8 (B ∨ ¬C )9 (¬C )10 ()
Horn clauses A clause in conjunctive normal form contains positive and negative literals and can be represented in the form
(¬A1 ∨ . . . ∨ ¬Am ∨ B1 ∨ . . . ∨ Bn ) This clause can be transformed in
A1 ∧ . . . ∧ Am ⇒ B1 ∨ . . . ∨ Bn .
Example If the weather is nice and there is snow on the ground, I will go skiing or I will work. (non-denite clause) If the weather is nice and there is snow on the ground, I will go skiing. (denite clause)
Denition Clauses with at most one positive literal of the form
(¬A1 ∨ . . . ∨ ¬Am ∨ B) or (¬A1 ∨ . . . ∨ ¬Am ) or B or (equivalently)
A1 ∧ . . . ∧ Am ⇒ B
or A1 ∧ . . . ∧ Am ⇒ f
or B.
are named Horn clauses (after their inventor). A clause with a single positive literal is a fact. In clauses with negative and one positive literal, the positive literal is called the head. To better understand the representation of Horn clauses, the reader may derive them from the denitions of the equivalences we have currently been using (Aufgabe ??).
Example: Knowledge base:
(nice_weather)1 (snowfall)2 (snowfall ⇒ snow)3 (nice_weather ∧ snow ⇒
skiing)4
Does skiing hold? Inference rule (generalized modus ponens):
A1 ∧ . . . ∧ Am , A1 ∧ . . . ∧ Am ⇒ B . B Proof for
skiing:
MP(2,3) : (snow)5 MP(1,5,4) : (skiing)6 . Modus ponens is complete for propositional logic Horn clauses.
starts with facts and nally derives the query backward chaining: starts with the query and works backwards until the facts are reached
forward chaining:
SLD resolution Selection rule driven linear resolution for denite clauses .
Example, augmented by the negated query (skiing ⇒ f )
SLD resolution:
(nice_weather)1 (snowfall)2 (snowfall ⇒ snow)3 (nice_weather ∧ snow ⇒ (skiing ⇒ f )5
Res(5,4) : Res(6,1) : Res(7,3) : Res(8,2) :
(nice_weather ∧ (snow ⇒ f )7 (snowfall ⇒ f )8 ()
skiing)4
snow
⇒ f )6
further processing is always done on the currently derived clause. Reduction of the search space. The literals of the current clause are always processed in a xed order (for example, from right to left) (Selection rule driven ). The literals of the current clause are called subgoals. The literals of the negated query are the goals.
I linear resolution:
I I
I I
Inference rule:
A1 ∧ . . . ∧ Am ⇒ B1 , B1 ∧ B2 ∧ . . . ∧ Bn ⇒ f . A1 ∧ . . . ∧ Am ∧ B2 ∧ . . . ∧ Bn ⇒ f
SLD resolution and PROLOG
I
I
I I
The proof (contradiction) is found, if the list of subgoals of the current clauses (the so-called goal stack) is empty. If, for a subgoal ¬Bi , there is no clause with the complementary literal Bi as its clause head, the proof terminates and no contradiction can be found. PROLOG programmes consist of predicate logic Horn clauses. Their processing is achieved by means of SLD resolution.
Computability and Complexity I
I
I I
I
I I
I
The truth table method determines every model of any formula in nite time. The sets of unsatisable, satisable, and valid formulas are decidable. T (n) = O(2n ) Optimization: semantic tree, grows exponentially in the worst case. At resolution, the number of derived clauses grows exponentially with the number of clauses in the worst case. S. Cook: the 3-SAT problem is NP-complete. 3-SAT is the set of all CNF formulas whose clauses have exactly three literals. For Horn clauses: the computation time for testing satisability grows only linearly as the number of literals in the formula increases.
Applications and Limitations
I I I I
I
I
Digital technology Verication of digital circuits Generation of test patterns Simple AI applications: discrete varibles, few values, no relations between variables Probabilistic logic (Ch. ) uses propositonal logic, modelles uncertainty Fuzzy logic, allows an innite number of truth values
Teil III
First-order Predicate Logic Syntax Semantics Quantiers and Normal Forms Proof Calculi Resolution Automated Theorem Provers
Statement: Robot 7 is situated at the xy position (35,79)
propositional logic variable: Robot_7_is_situated_at_xy_position_(35,79)
100 robots on a grid of 100 × 100 points. ⇒ 100 · 100 · 100 = 1 000 000 = 106 variables
Relation Robot
A
is to the right of robot
B.
(100 · 99)/2 = 4950 ordered pairs of x values. 104 formulas of the type: Robot_7_is_to_the_right_of_robot_12
⇔
Robot_7_is_situated_at_xy_position_(35,79)
∧ Robot_12_is_situated_at_xy_position_(10,93) ∨ ... with (104 )2 · 0.495 = 0.495 · 108 alternatives on the right side.
First-order predicate logic: position(number, xPosition, yPosition)
∀u ∀v is_further_right(u, v ) ⇔ ∃xu ∃yu ∃xv ∃yv position(u, xu , yu ) ∧ xu > xv ,
position(v , xv , yv )
∧
Syntax
Terms, e.g.: f (sin(ln(3)), exp(x))
Denition Let V be a set of variables, K a set of constants, and F a set of function symbols. The sets V , K and F are pairwise disjoint. We dene the set of terms recursively:
• All variables and constants are (atomic) terms.
• If t1 , . . . , tn are terms and f an n-place function symbol, then f (t1 , . . . , tn ) is also a term.
Denition P
Let
be a set of predicate symbols.
Predicate logic formulas are built
as follows:
•
If
•
If
•
If
t1 , . . . , tn are terms and p an n-place p(t1 , . . . , tn ) is an (atomic) formula.
A and B are formulas, then ¬A, (A), A ∧ B, A ∨ B, A ⇒ B, A ⇔ B x
is a variable and
formulas.
∀
• p(t1 , . . . , tn ) •
predicate symbol, then
A
a formula, then
are also formulas.
∀x A and ∃x A are also ∃ the existencial quantier.
is the universal quantier and and
¬p(t1 , . . . , tn )
are called literals.
Formulas in which every variable is in the scope of a quantier are called
rst-order sentences or closed formulas. Variables which are not in the scope of a quantier are called free variables. • The denitions ?? (CNF) and ?? (Horn clauses) hold for formulas of predicate logic literals analogously.
Examples: Formula
∀x ∀x
(x) ⇒
(x) ∧
∀x
All brown frogs are big
(x, cake)
Everyone likes cake
likes
(x, cake)
Not everyone likes cake
¬∃x
likes
(x, cake)
No one likes cake
∃x ∀y
likes
(y , x)
There is something that everyone likes
∃x ∀y
likes
(x, y )
There is someone who likes everything
∀x ∃y
likes
(y , x)
Everything is loved by someone
∀x ∃y
likes
(x, y )
(x) ⇒ (x) ∧
customer
(x) ∧ ∀y
baker
(x)
big
¬∀x
customer
∃x ∃x
brown
likes
All frogs are green
green
(x) ⇒
frog
∀x
Description
(x)
frog
Everyone likes something
(
, x)
( ,
)
likes bob
likes x bob
(y ) ⇒
customer
Bob likes every customer There is a customer whom bob likes
( , )
mag x y
There is a baker who likes all of his customers
Semantics
Denition An
assignment
or
interpretation
B is dened as
• a mapping from the set of constants and variables K ∪ V to a set W of names of objects in the world; • a mapping from the set of function symbols to the set of functions in the world. Every n-place function symbol is assigned an n-place function; • a mapping from the set of predicate symbols to the set of relations in the world. Every n-place predicate symbol is assigned an n-place relation.
Example Constants:
c1 , c2 , c3
, two-place function symbol plus , two-place
predicate symbol gr .
F ≡
gr(plus(c1 , c3 ), c2 )
Choose interpretation:
B1 : c1 7→ 1, c2 7→ 2, c3 7→ 3,
plus
7→ +,
gr
7→ >
Thus the formula is mapped to
1 + 3 > 2,
resp.
4>2
after evaluation
G on {1, 2, 3, 4}: G = {(4, 3), (4, 2), (4, 1), (3, 2), (3, 1), (2, 1)}. Because (4, 2) ∈ G , F is true under the interpretation B1 . The greater-than relation
B2 : c1 7→ 2, c2 7→ 3, c3 7→ 1,
we obtain
2 − 1 > 3,
bzw.
plus
7→ −,
gr
7→ >,
1 > 3.
(1, 3) is not a member of G .
Obviously, the truth of a formula in PL1 depends on the interpretation.
Denition •
An atomic formula
p(t1 , . . . , tn )
is
true under the interpretation B if, t1 , . . . , tn
after interpretation and evaluation of all terms interpretation of the predicate
p
and
through the n-place relation
r,
it holds
that
(B(t1 ), . . . , B(tn )) ∈ r . •
The truth of quantierless formulas follows from the truth of atomic formulas as in propositional calculus through the semantics of the logical operators.
•
A formula
•
A formula
∀x F
is true under the interpretation
∃x F
is true under the interpretation
B
exactly when it is true
given an arbitrary change of the interpretation for the variable
a interpretation for
x
B
x.
exactly when there is
which makes the formula true.
The denitions of semantic equivalence of formulas, for the concepts satisable, true, unsatisable, and model, along with semantic entailment (denitions
??, ??, ??) carry over unchanged from propositional calculus
to predicate logic.
Theorem The theorems ?? (deduction theorem) and ?? (proof by contradiction) hold analogously for PL1.
Example A family tree:
Henry A.
Karen A.
Franz A.
Anne A.
Oscar A.
Mary B.
Eve A.
Isabelle A.
Oscar B.
Clyde B.
Relation Child = { (Oscar A., Karen A., Frank A.), (Henry A., Anne A., Oscar A.),
(Mary B., Karen A., Frank A.), (Eve A., Anne A., Oscar A.),
(Isabelle A., Anne A., Oscar A.), (Clyde B., Mary B., Oscar B.) }
one-place relation: Female = {Karen A., Anne A., Mary B., Eve A., Isabelle A.}
Predicates child(x, y , z)
with the semantic
B(child(x, y , z)) = w ≡ (B(x), B(y ), B(z)) ∈ Kind.
Belegung B(oscar) = Oscar A., B(eve) = Eve A., B(anne) = Anne A. child(eve, anne, oscar ) is true! Does hold?
child(eve, oscar , anne)
also
∀x ∀y ∀z ∀x ∀y
descendant(x, y )
See also Aufgabe
??!
child(x, y , z)
⇔
⇔
child(x, z, y ),
∃z child(x, y , z) ∨ (∃u ∃v child(x, u, v ) ∧
descendant(u, y )
Knowledge base
WB ≡
Is Is
∧ female(anne) ∧ female(mary) ∧ female(eve) ∧ female(isabelle) ∧ child(oscar,karen,franz) ∧ child(mary,karen,franz) ∧ child(eve,anne,oscar) ∧ child(henry,anne,oscar) ∧ child(isabelle,anne,oscar) ∧ child(clyde,mary,oscarb) ∧ (∀x ∀y ∀z child(x, y , z) ⇒ child(x, z, y )) ∧ (∀x ∀y descendant(x, y ) ⇔ ∃z child(x, y , z) ∨ (∃u ∃v child(x, u, v ) ∧ descendant(u, y ))).
female(karen)
child(eve,oscar,anne)
derivable? derivable?
descendant(eve,franz)
Equality
Equality axioms
∀x x = x (reexivity) ∀x ∀y x = y ⇒ y = x (symmetry) ∀x ∀y ∀z x = y ∧ y = z ⇒ x = z (transitivity). ∀x ∀y x = y ⇒ f (x) = f (y )
(substitution axiom)
(3)
(4)
Replacement of terms
Example ∀x x = 5 ⇒ x = y replace
y
by sin(x):
∀x x = 5 ⇒ x = sin(x)
wrong!
correct:
∀x x = 5 ⇒ x = sin(z)
Denition We write ϕ[x/t] for the formula that results when we replace every free occurence of the variable x in ϕ with the term t . Thereby we do not allow any variables in the term t that are quantied in ϕ. In those cases variables must be renamed to ensure this.
Example ∀x x = y , the free variable y is replaced by the term x + 1, the result is ∀x x = x + 1. With correct substitution we obtain the formula ∀x x = y + 1 which has a much dierent If, in the formula
semantic.
Quantiers and Normal Forms Denition
??:
∀x p(x) ≡ p(a1 ) ∧ , . . . , ∧ p(an )
for all constants a1 . . . an in K .
de Morgan's law:
∃x p(x) ≡ p(a1 ) ∨ . . . ∨ p(an ) ∀x ϕ ≡ ¬∃¬ ϕ.
Example Everyone wants to be loved be loved .
≡
Nobody doesn't want to
Denition A predicate logic formula ϕ is in holds that
• ϕ = Q1 x1 . . . Qn xn ψ .
• ψ is a quantierless formula. • Qi ∈ {∀, ∃} for i = 1, . . . , n.
prenex normal form
if it
Caution
Rename variable:
∀x p(x) ⇒ ∃x q(x). ∀x p(x) ⇒ ∃y q(y )
Bring quantier to the front
∀x ∃y p(x) ⇒ q(y ). How is this done with the following formula?
(∀x p(x)) ⇒ ∃y q(y )
(5)
Example The convergence of a series
(an )n∈N
to a limit
a
is dened by
∀ε > 0 ∃n0 ∈ N ∀n > n0 |an − a| < ε formal:
|x|, a(n) for an , x − y, el(x, y ) for x ∈ y , gr(x, y ) abs(x) for
minus(x, y ) for
for
x > y:
∀ε (gr (ε, 0) ⇒ ∃n0 (el(n0 , N) ⇒ ∀n (gr (n, n0 ) ⇒ gr (ε, abs(minus(a(n), a)))))), Eliminating implications:
∀ε (¬gr (ε, 0) ∨ ∃n0 (¬el(n0 , N) ∨ ∀n (¬gr (n, n0 ) ∨ gr (ε, abs(minus(a(n), a)))))). Quantiers to the front:
∀ε ∃n0 ∀n (¬gr (ε, 0) ∨ ¬el(n0 , N) ∨ ¬gr (n, n0 ) ∨ gr (ε, abs(minus(a(n), a))))
(6)
Theorem Every predicate logic formula can be transformed into an equivalent formula in prenex normal form.
Skolemization ∀x1 ∀x2 ∃y1 ∀x3 ∃y2 p(f (x1 ), x2 , y1 ) ∨ q(y1 , x3 , y2 ). ∀x1 ∀x2 ∀x3 ∃y2 p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , y2 ) ∀x1 ∀x2 ∀x3 p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , h(x1 , x2 , x3 )) p(f (x1 ), x2 , g (x1 , x2 )) ∨ q(g (x1 , x2 ), x3 , h(x1 , x2 , x3 )).
Equation
??
¬gr (ε, 0) ∨ ¬el(n0 (ε), N) ∨ ¬gr (n, n0 (ε)) ∨ gr (ε, abs(minus(a(n), a))). By dropping the variable n0 , the Skolem function can receive the name n0 .
Skolemization
∀x1 . . . ∀xn ∃y ϕ is replaced by ∀x1 . . . ∀xn ϕ[y /f (x1 , . . . , xn )] during which f may not appear in ϕ. In ∃y p(y ), then y must be replaced by a constant I
I
The resulting formula is no longer equivalent to the output formula. However, the satisability remains unchanged.
Programme scheme
NormalFormTransformation(Formula)
1. Transformation into prenex normal form: Transformation into conjunctive normal form (Satz
??):
Elimination of equivalences. Elimination of implications. Repeated application of de Morgan's law and distributive law. Renaming of variables if necessary. Factoring out universal quantiers.
2. Skolemization: Replacement of existencially quantied variables by new Skolem functions. Deletion of resulting universal quantiers.
runtime
1. exponential (naive), 1. polynomial [Ede91], 2. polynomial
Die universelle Logikmaschine
Natural reasoning
I I I I I
Gentzen calculus Sequent calculus Meant to be applied by humans Intuitive inference rules Work on arbitrary PL1 formulas
Example Two simple inference rules:
A, A ⇒ B B
(Modus Ponens, MP)
∀x A (∀-Elimination, ∀E ). A[x/t]
Variable x must be replaced by a ground term t Proof for child(eve,oscar,anne): WB :
1
child(eve,anne,oscar)
WB :
2
∀x ∀y ∀z child(x, y , z) ⇒ child(x, z, y )
∀E (2) : x/eve, y /anne, z/oscar
3
child(eve,anne,oscar) ⇒ child(eve,oscar,anne)
MP(1, 3)
4
child(eve,oscar,anne)
calculus not complete!
Theorem (Gödel's completeness theorem) First-order predicate logic is complete. That is, there is a calculus with which every
WB WB ` ϕ.
proposition that is a consequence of a knowledge base can be proved. If
I I I
WB |= ϕ,
then it holds that
Every true proposition in rst-order predicate logic is provable. Is the reverse also true? Is everything we can derive syntactically actually true?
Theorem (Correctness) There are calculi with which only true propositions can be proved. That is, if
WB |= ϕ. I
I
WB ` ϕ
holds, then
Provability and semantic consequence are equivalent concepts, as long as the calculus is correct and complete. Calculi of natural deduction are rather unsuited for automization.
Resolution
Resolution proof for Beispiel
??:
Request Q ≡ ¬child(eve,oscar,anne)
Knowledge base in conjunctive normal form: WB ∧ ¬Q
≡
(child(eve,anne,oscar))1 ∧ (¬child(x, y , z) ∨ child(x, z, y ))2 ∧ (¬child(eve,oscar,anne))3 .
Proof: (2) x/eve, y /anne, z/oscar : (¬child(eve,anne,oscar) ∨ child(eve,oscar,anne))4 Res(3, 4) : (¬child(eve,anne,oscar))5 Res(1, 5) : ()6 ,
Beispiel
I Everybody knows his own mother I Does Henry know anyone?
(knows(x, mother(x)))1 ∧ (¬knows(henry, y ))2 . Unication:
x/henry, y /mother(henry)
(knows(henry, mother(henry)))1 ∧ (¬knows(henry, mother(henry )))2 .
Denition Two literals are called uniable if there is a substitution σ for all variables which makes the literals equal. Such a σ is called a unier. A unier is called the most general unier (MGU) if all other uniers can be obtained from it by substitution of variables.
Example We want to unify the literals
p(f (g (x)), y , z)
and
p(u, u, f (u)).
Several uniers are
σ1 σ2 σ3 σ4 σ5
: : : : :
x/h(v ), x/h(h(v )), x/h(a), x/a,
where
σ1
σ1
y /f (g (x)), y /f (g (h(v ))), y /f (g (h(h(v )))), y /f (g (h(a))), y /f (g (a)),
z/f (f (g (x))), z/f (f (g (h(v )))), z/f (f (g (h(h(v ))))), z/f (f (g (h(a)))), z/f (f (g (a))),
u/f (g (x)), u/f (g (h(v ))) u/f (g (h(h(v )))) u/f (g (h(a))) u/f (g (a))
is the most general unier. The other uniers result from
through the substitutions
x/h(v ), x/h(h(v )), x/h(a), x/a.
I I I I
Predicate symbols can be treated like function symbols. That is, the literal is treated like a term. Arguments of functions are processed sequentially. Terms are unied recursively over the term structure.
Complexity I I
I
I
The simplest unication algorithms are very fast in most cases. Worst case: the computation time grows exponentially with the size of the terms. In practice, nearly all unication attempts fail, in most cases the worst case complexity has no dramatic eect. The fastest unication algorithms have nearly linear complexity [Bib92].
General resolution rule for predicate logic
Denition The resolution rule for two clauses in conjunctive normal form reads
(A1 ∨ . . . ∨ Am ∨ B), (¬B 0 ∨ C1 ∨ . . . ∨ Cn ) σ(B) = σ(B 0 (σ(A1 ) ∨ . . . ∨ σ(Am ) ∨ σ(C1 ) ∨ . . . ∨ σ(Cn )) (7) where σ is the MGU of B and B 0 .
Theorem The resolution rule is correct. That is, the resolvent is a semantic consequence of the two parent clauses.
Example Russel's paradox: There is a barber who shaves everyone who does not shave himself. Formalized in PL1:
∀x
shaves(barber, x)
⇔ ¬shaves(x, x)
Clause form (see Aufgabe ??):
(¬shaves(barbier, x) ∨ ¬shaves(x, x))1 ∧ (shaves(barbier, x) ∨ shaves(x, x (8) I No contradiction derivable! I Thus: the resolution is not complete!
Denition Factorization
of a clause is accomplished by
(A1 ∨ A2 ∨ . . . ∨ An ) σ(A1 ) = σ(A2 ) , (σ(A2 ) ∨ . . . ∨ σ(An )) where σ is the MGU of A1 and A2 . Now we can derive a contradiction from Equation
??
Fak(1, σ : x/barber ) : (¬shaves(barber, barber))3 Fak(2, σ : x/barber ) : (shaves(barber, barber))4 Res(3, 4) : ()5 . and we assert
Theorem The resolution rule (??) together with the factorization rule (??) is refutation complete. That is, by application of factorization and resolution steps, the empty clause can be derived from any unsatisable formula in conjunctive normal form.
Combinatorial Explosion of the Search Space
Resolution Strategies
Search space reduction by certain strategies:
Unit Resolution I
I I
Prioritizes resolution steps in which one of the two clauses consists of only one literal, called a unit clause. Complete Heuristic (search space reduction not guaranteed)
Set of support strategy
Set of support (SOS) ⊂ WB ∧ ¬Q I
I I I I I
Resolution only between clauses from the SOS and the complement. Resolvent is added to the SOS. Search space reduction guaranteed. Not complete. Complete, if WB ∧ ¬Q\SOS satisable. Often the negated query ¬Q is used as the initial SOS.
Input resolution
I
I I
A clause from the input set WB ∧ ¬Q must be involved in every resolution step. Search space reduction guaranteed. Not complete.
Pure literal rule
I
I I I
All clauses that contain literals for which there are no complementary literals in other clauses are deleted. Search space reduction guaranteed. Complete. Is used by practically all resolution provers.
Subsumption
I
I
I I I
If the literals of a clause K1 represent a subset of the literals of the clause K2 , then K2 can be deleted. For example, the clause (raining(today) ⇒ street_wet(today )) is redundant if street_wet(today ) is already valid. Search space reduction guaranteed. Complete. Is used by practically all resolution provers.
Equality Equality axioms:
∀x x = x ∀x ∀y x = y ⇒ y = x ∀x ∀y ∀z x = y ∧ y = z ⇒ x = z Solution:
special inference rules for equality.
Demodulation:
An equation t1 = t2 is applied by means of unication to a term t as follows: t1 = t2 , (. . . t . . .), σ(t1 ) = σ(t) . (. . . σ(t2 ) . . .) Paramodulation,
works with conditional equations [Bib92; BB92].
t1 = t2 allowed: 1. Substitution of t1 by t2 2. Substitution of t2 by t1
Solution: I Equations are often used in one direction only ⇒ directed equations I Term rewriting systems I
Special equality provers
Automated Theorem Provers
Otter, 1984
[Kal01] I Successful resolution prover (with equality handling) I L. Wos, W. McCune: Argonne National Laboratory, Chicago Currently, the main application of Otter is research in abstract algebra and formal logic. Otter and its predecessors have been used to answer many open questions in the areas of nite semigroups, ternary Boolean algebra, logic calculi, combinatory logic, group theory, lattice theory, and algebraic geometry.
SETHEO, 1987
[Let+92] I PROLOG technology I Warren Abstract Machine I W. Bibel, J. Schumann, R. Letz: Munich Technical University I PARTHEO: implementation on parallel computers
E, 2000
[Sch02] I Modern equality prover I S. Schulz: Munich Technical University Homepage of E: E is a a purely equational theorem prover for clausal logic. That means it is a program that you can stu a mathematical specication (in clausal logic with equality) and a hypothesis into, and which will then run forever, using up all of your machines resources. Very occasionally it will nd a proof for the hypothesis and tell you so ;-).
Vampire I I I
Resolution with equality handling A. Voronkov: University of Manchester, England Winner of CADE-20, 2005
Isabelle [NPW02] I I
Interactive prover for higher-order predicate logic T. Nipkov, L. Paulson, M. Wenzel: Univ. Cambridge, Munic Techn. Univ.
Mathematical Examples
Application of E [Sch02] on: Left- and right-neutral elements in a semigroup are equal
Denition A structure (M, ·) consisting of a set M with a two-place inner operation · is called a semigroup if the law of associativity
∀x ∀y ∀z (x · y ) · z = x · (y · z) holds. An element e ∈ M is called left-neutral (right-neutral) if ∀x e · x = x (∀x x · e = x ).
It has to be shown:
Theorem
If a semigroup has a left-neutral element element
er ,
then
el
and a right-neutral
el = er .
Proof, variant 1
intuitive mathematical reasoning Clearly it holds for all x ∈ M that
el · x = x
(9)
x · er = x
(10)
and If we set x = er in Equation ?? and x = el in Equation obtain the two equations el · er = er and el · er = el . Thus: el = el · er = er ,
??,
we
Proof, variant 2
resolution proof manually
(¬ el = er )1 negated query (m(m(x, y ), z) = m(x, m(y , z)))2 (m(el , x) = x)3 (m(x, er ) = x)4 Equality axioms:
(x = x)5 (¬ x = y (¬ x = y (¬ x = y (¬ x = y
∨ ∨ ∨ ∨
y = x)6 ¬ y = z ∨ x = z)7 m(x, z) = m(y , z))8 m(z, x) = m(z, y ))9
(reexivity) (symmetry) (transitivity) Substitution in m Substitution in m,
Proof: Res(3, 6, x6 /m(el , x3 ), y6 /x3 ) : Res(7, 10, x7 /x10 , y7 /m(el , x10 )) : Res(4, 11, x4 /el , x11 /er , z11 /el ) : Res(1, 12, ∅) :
(x = m(el , x))10 (¬ m(el , x) = z ∨ x = z)11 (er = el )12 ().
Proof, variante 3
automated resolution proof with the prover E Transformation in clause normal form language LOP:
(¬A1 ∨ . . . ∨ ¬Am ∨ B1 ∨ . . . ∨ Bn )
7→
B1 ; . . . ;Bn = 2, % Abstand M¨ uller/Schmid >= 2 Huber #= Mathe, % Huber pr¨ uft Mathematik Physik #= 4, % Physik in Raum 4 Deutsch #\= 1, % Deutsch nicht in Raum 1 Englisch #\= 1, % Englisch nicht in Raum 1 nl, write([Maier, Huber, Mueller, Schmid]), nl, write([Deutsch, Englisch, Mathe, Physik]), nl.
Output: [3,1,2,4] [2,3,1,4]
Room plan: Room num. Teacher Subject
1 Hoover Math
2 Miller German
Finite domain constraint solver
Übung: Einstein puzzle
3 Mayer English
4 Smith Physics
Summary
I I I I I I I
Unication, lists, declarative programming Relationale view on procedures Parameters for input and output short programmes Tool for Rapid Prototyping CLP for optimization and planning tasks and logic puzzles PROLOG in Europe, LISP in USA
Literature:
[Bra86] und [CM94], Handbooks: [Wie04; Dia04], CLP: [Bar98]
Teil VI
Search, Games and Problem Solving Introduction Uninformed Search Heuristic Search Games with Opponents State of the Art
A heavily trimmed search tree or: Where is my cat?
A Search Tree
hP (((hh P P P
hh hX ( P(( X P P
((hhhh (h hh ((( (` ` X X P
(((hhh ( h `` X X
X X P
*
Search tree for simple SLD resolution proof (depth bound 14)
Example: Chess I I I I I I I
Branching factor b = 30, depth d = 50: 3050 ≈ 7.2 · 1073 leaves. 50 X 1 − 3051 Number of inference steps = 30d = = 7.4 · 1073 , 1 − 30 d=0 10000 computers each CPU: one billion inferences per second parallelization without loss computation time:
7.4 · 1073 inferences = 7.4 · 1060 sec ≈ 2.3 · 1053 years, 10000 · 109 inferences/sec I
1043 times age of the universe
Questions:
I
I
Why do good chess players exist and nowadays also good chess computers? Why do mathematicians nd proofs for propositions in which the search space is even larger?
The 8-Puzzle 2
5
1
4
8
7
3
6
→
1 4 5 7 8
.. .
2 5 3 8 6
.. .
1 5 2 4 3 7 8 6
.. .
.. .
.. .
.. .
3
4
5
6
7
8
1 2 4 5 3 7 8 6
1 5 2 4 8 3 7 6
.. .
2
2 3 6
1 5 2 4 3 7 8 6
1 4 7
1
.. .
1 7
.. .
.. .
5 2 4 3 8 6
.. .
4 1 2 5 3 7 8 6
.. .
.. .
.. .
1 4 7
1 2 4 5 3 7 8 6
.. .
.. .
.. .
average branching factor =
1 4 7
.. .
√
.. .
2 5 3 8 6
2 5 3 8 6
.. .
1 4 7
.. .
8 ≈ 2.83
.. .
.. .
2 3 5 8 6
1 2 3 4 5 6 7 8
Denition The average branching factor of a tree is dened as the branching factor that a tree with constant branching factor, equal depth, and equal number of leaf nodes would have (see Aufgabe ??).
1 4 5 7 8
2 3 6
1 5 2 4 3 7 8 6
1 5 2 4 3 7 8 6
1 5 2 4 8 3 7 6
1 5 2 4 3 7 8 6
1 5 4 3 2 7 8 6
1 5 2 4 3 6 7 8
1 5 2 4 8 3 7 6
1 5 2 4 8 3 7 6
5 2 1 4 3 7 8 6
1 5 2 7 4 3 8 6
.. .
.. .
.. .
.. .
.. .
.. .
1 2 4 5 3 7 8 6
1 2 4 5 3 7 8 6
4 1 2 5 3 7 8 6
1 2 3 4 5 7 8 6
4 1 2 5 3 7 8 6
.. .
.. .
4 1 2 7 5 3 8 6
.. .
.. .
1 2 4 7 8
.. .
.. .
3 5 6
1 2 3 4 5 6 7 8
.. .
average branching factor ≈ 1.89 9 For an 8 puzzle the average branching factor depends on the start state.
Denition A search problem is dened by the following values State: description of the state of the world in which the agent nds itself. Starting state: the initial state in which the search agent is started. Goal state: if the agent reaches a goal state, then it terminates and outputs a solution (if desired). Actions: all of the agents allowed actions. Solution: the path in the search tree from the starting state to the goal. Cost function: assigns a cost value to every action. Necessary for nding a cost-optimal solution. State space: set of all states. Search tree: States are leaves, actions are edges.
Applied to the 8 puzzle, we get
3 × 3 matrix S with the values 1,2,3,4,5,6,7,8 (once each) and one empty square. Starting state: An arbitrary state. Goal state: An arbitrary state, e.g. the state given to the right in Abbildung ?? Actions: Movements of the empty square Sij to the left (if j 6= 1), right (if j 6= 3), up (if i 6= 1), down (if i 6= 3). Cost function: The constant function 1, since all actions have equal cost. State space: The state space is degenerate in domains that are mutually unreachable (Aufgabe ??). Thus there are unsolvable 8 puzzle problems. State:
For analysis of the search algorithms, the following terms are needed:
Denition • The number of successor states of a state s is called the branching factor (engl. branching factor) b(s), or b if the branching factor is constant. • The eective branching factor of a tree of depth d with n total nodes is dened as the branching factor that a tree with constant branching factor, equal depth, and equal n would have (see Aufgabe ??). • A search algorithm is called complete if it nds a solution for every solvable problem. If a complete search algorithm terminates without nding a solution, then the problem is unsolvable.
Solving the equation
n=
d X i=0
bi =
b d+1 − 1 b−1
for b yields the eective branching factor
Theorem For heavily branching nite search trees with a large constant branching factor, almost all nodes are on the last level.
Proof: (Aufgabe
??).
Example Shortest path from city Frankfurt
111
85 67 Karlsruhe
A
to city
Würzburg
Bayreuth 104
183
140 64
Stuttgart
Ulm
107
170 123
85
115
München 59
Rosenheim 81
184
126
Salzburg
93
Zürich
91
Passau 102
189
55 Memmingen
Basel
220
171
191
Bern
75 Nürnberg
230 Mannheim
B
120 Landeck
73
Innsbruck
The graph of southern Germany with cost function.
Linz
A city as the current location of the traveler. Starting state: An arbitrary city. Goal state: An arbitrary city. Actions: Travel from the current city to the neighboring city. Cost function: The distance between the cities. State space: All cities, i.e. nodes of the graph. State:
Denition A search algorithm is called optimal if it, if a solution exists, always nds the solution with the lowest cost. The 8 puzzle problem is I deterministic:
I I
I
every action leads from a state to a unique
successor state. observable: the agent always knows which state it is in. Using so-called oine algorithms, optimal solutions can be found. Otherwise: Reinforcement learning
Uninformed Search
Breadth-rst search
1 2 5
6
3 7
8
9
4 10
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
11 12
Breadth-rst search Breadth-rst-search(NodeList, Goal) NewNodes = ∅ For all Node ∈ NodeList If Goal_Reached(Node, Goal) Return(Solution found, Node) NewNodes = Append(NewNodes, Successors(Node)) If NewNodes 6= ∅ Return(Breadth-rst-search(NewNodes, Goal)) Else Return(No
I I
solution)
The algorithm is generic Application-specic functions GoalReached und Successors
Analysis
I I I
complete. optimal, if all costs of all actions are the same (see Exercise). Computation time =
c· I
d X
bi =
i=0
b d+1 − 1 = O(b d ). b−1
Memory space = O(b d ). the node with the lowest cost from the ascendingly sorted list of nodes is always expanded. Always optimal!
I Uniform Cost Search:
Depth-rst search
1
1
2 3 4
2
1 3 4
1
2
5 6 7 8
5
3 4
1 2 8 18 19 20
1 3 4
3 21 22
2
6 7 8
6
9 10 11
3 4 7 8
1 4
3 21 23 24 25
22
3
15 16 17 1
4
2 7
12 13 14 1
4
1
4
22
29
26 27 28
30 31 32
3 4 8
Depth-rst search Depth-rst-search(Node,Goal) GoalReached(Node,Goal) Return(Solution found) NewNodes = Successors(Node) While NewNodes 6= ∅ Result = Depth-rst-search(First(NewNodes), Goal) If Result = Solution found Return(Solution found) NewNodes = Rest(NewNodes) Return(No solution) If
The algorithm for the depth-rst search.
Analysis
I I I I
incomplete. not optimal Computation time = O(b d ) Memory requirement= O(bd)
Iterative Deepening 1 2
3
4
5
6
7
depth limit
IterativeDeepening(Node, Goal) DepthLimit = 0 Repeat
Result = DepthFirstSearch-B(Node, Goal, 0, DepthLimit) DepthLimit = DepthLimit + 1 Until Result = Solution found
Depth-First Search with Bound
DepthFirstSearch-B(Node, Goal, Depth, Limit) GoalReached(Node,Goal) Return(Solution found) NewNodes = Successors(Node) While NewNodes 6= ∅ And Depth < Limit Result = DepthFirstSearch-B(First(NewNodes), Goal, Depth + 1, Limit) If Result = Solution found Return(Solution found) NewNodes = Rest(NewNodes) Return(No Solution found) If
Analysis
I I I I
complete. optimal, if costs are constant and increment = 1 computation time = O(b d ) Memory requirements = O(bd)
Loss from repeated computations:
Nb (dmax ) =
dmax X i=0
bi =
b dmax +1 − 1 b−1
dmax X−1 d=1
! ! dmax X−1 1 b d+1 − 1 d+1 − dmax + 1 Nb (d) = = b b−1 b−1 d=1 d=1 ! ! dmax +1 dmax X 1 1 b −1 d = − dmax + 1 = b − b−1 b−1 b−1 d=2 dmax +1 1 b −1 1 ≈ = Nb (dmax ) b−1 b−1 b−1 dmax X−1
For b = 20 the rst dmax − 1 trees together contain only about 1 = 1/19 of the number of nodes in the last tree. b−1
Comparison
Breadth-rst Uniform Depth-rst- Iterative search Cost Search search Deepening Completeness yes yes no yes Optimal solution yes (*) yes no yes(*) d d ds Computation time b b ∞ or b bd Memory use bd bd bd bd
(*): only true with constant action cost. ds is the maximal depth for a nite search tree.
Heuristic Search
are problem-solving strategies which in many cases nd a solution faster than uninformed search. There is no guarantee! In everyday life, heuristic methods are important.
I Heuristics
I I
I Realtime-decisions under limited resources I
A good solution found quickly is preferred over a solution that is optimal, but very expensive to derive.
Mathematical modeling
f (s) for states + heuristic evaluation + ...
I heuristic evalutation function I Node
=
state
The Algorithm HeuristicSearch(Start, Goal) NodeList = [Start] While True If NodeList= ∅ Return(No solution) Node = First(NodeList) NodeList = Rest(NodeList) If Goal_Reached(Node,Goal) Return(Solution found, Node) NodeList = SortIn(Successors(Node),NodeList) 10
Depth-rst and breadth-rst search are special cases of the function
HeuristicSearch!
10 When sorting in a new node from the node list, it may be advantageous to check whether the node is already available and, if so, to delete the duplicate.
Remarks
I
I
The ideal heuristic would be a function that calculates the actual costs from each node to the goal. In the real world: cost estimate function h
He: Dear, think of the fuel costs! I'll pluck one for you somewhere else. She: No, I want that one over there!
Greedy Search Basel
Frankfurt
111
85 67 Karlsruhe
Würzburg
Bayreuth 104
Mannheim
183
140 64
Stuttgart
Ulm
107
170 123
85
115
München 59
Rosenheim 81
184
93
Zürich
91
Passau 102
189
55 Memmingen
Basel
220
171
191
Bern
75 Nürnberg
230
120 Landeck
73
Innsbruck
Salzburg
126
Linz
204
Bayreuth
207
Bern
247
Frankfort
215
Innsbruck
163
Karlsruhe
137
Landeck
143
Linz
318
München
120
Mannheim
164
Memmingen
47
Nürnberg
132
Passau
257
Rosenheim
168
Stuttgart
75
Salzburg
236
Würzburg
153
Zürich
157
Cost estimate function h(s) = ying distance from city s to Ulm.
Linz h = 318
Mannheim h = 164
Passau Salzburg h = 257 h = 236 Linz h = 318
Rosenheim h = 168
Innsbruck h = 163
M¨ unchen h = 120
Karlsruhe h = 137
N¨ urnberg h = 132
W¨ urzburg Mannheim Ulm h = 153 h = 164 h=0
M¨ unchen Passau h = 120 h = 257
Salzburg h = 236
Memmingen Ulm N¨ urnberg Passau Rosenheim h = 47 h = 0 h = 132 h = 257 h = 168 I I
Frankfurt h = 215
MannheimNürnbergUlm: 401 km MannheimKarlsruheStuttgartUlm: 323 km
Bayreuth h = 207
A? -Search Cost function
g (s) = Sum of accrued costs from root to current node, Heuristic estimate heuristic
h(s) = Estimated cost from current node to goal evaluation function f (s) = g (s) + h(s).
Requirement:
Denition A heuristic cost estimate function h(s) that never overestimates the actual cost from state s to the goal is called admissible. = HeuristicSearch with evaluation function f (s) = g (s) + h(s) and admissible heuristic h
A? -Algorithm
Frankfurt (0) 0,215,215
W¨ urzburg (1) 111,153,264
Mannheim (2) 85,164,249
Frankfurt (3) Karlsruhe (4) N¨ urnberg (5) 170,215,385 152,137,289 315,132,447 Frankfurt (0) 0,215,215
W¨ urzburg (1) 111,153,264
Mannheim (2) 85,164,249
Frankfurt (6) Stuttgart (7) Ulm (8) N¨ urnberg (9) 222,215,437 251,75,326 294,0,294 215,132,347
Karlsruhe (3) 152,137,289
Mannheim (10) 219,164,383
Stuttgart (11) 216,75,291
N¨ urnberg (4) Frankfurt (5) 315,132,447 170,215,385
Basel (12) 343,204,547
Karlsruhe (13) W¨ urzburg (14) Ulm (15) 280,137,417 356,153,509 323,0,323
In the boxes below the names of the city
s
we show
g (s), h(s), f(s).
Numbers in parentheses after the city names show the order in which the nodes have been generated.
Theorem The A
?
algorithm is optimal. That is, it always nds the
solution with lowest total cost if the heuristic
h
is admissible.
Proof:
Solution node
• l
s
f (l) ≤ f (s)
l0
f (s) ≤ g (l 0 )
The rst solution node l found by A? never has a higher cost than another arbitary node l 0 .
g (l) = g (l) + h(l) = f (l) ≤ f (s) = g (s) + h(s) ≤ g (l 0 ).
IDA? -Search
Weaknesses of A? : I I
High memory requirements List of open nodes must be sorted ⇒ Heapsort
Solution: Iterative Deepening I I
The same as depth-rst search, but limit for heuristic evaluation f (s).
Route Planning with A? -Search11 I I I I
Simple heuristic: air line distance Better: Landmarks 5 60 randomly selected cities Preprocessing: For all landmarks the shortest paths to all nodes are stored.
Let: l : landmark, s : current node, z : goal node, c ? (x, y ) = shortest path from x to y Triangle inequality:
c ? (s, l) ≤ c ? (s, z) + c ? (z, l).
Solving for c ? (s, z) yields
h(s) = c ? (s, l) − c ? (z, l) ≤ c ? (s, z).
h(s) is admissible. 11 A. Batzill.
Optimal Route Planning on Mobile Systems.
Hochschule Ravensburg-Weingarten. 2016.
Masterarbeit,
A? with: no heuristic (red), air line distance (d. green), landmark heuristic (blue)
Results12
no heuristic air line distance landmark heuristic
unidirectional tree size run time [nodes] [msec.] 62000 192 9380 86 5260 16
bidirectional tree size run time [nodes] [msec.] 41850 122 12193 84 7290 16
Advantages of the Landmark Heuristic: I I
Smaller search space than air line heuristic Fast computation: only table lookup
12 A. Batzill.
Optimal Route Planning on Mobile Systems.
Hochschule Ravensburg-Weingarten. 2016.
Masterarbeit,
Optimization of time, not distance
I I I
d(s, z) is replaced by t(s, z) = d(s, z)/vmax air line heuristic estimate becomes worse landmark heuristic is still very good
Further Improvement
I
Contraction hierarchies13
13 R. Geisberger u. a. Contraction hierarchies: Faster and simpler hierarchical routing in road networks. S. 319333.
In:
Experimental Algorithms.
Springer, 2008,
Comparison of search algorithms Admissible heuristics for the 8 puzzle: I I
h1 counts the number of squares that are not in the right place h2 measures the Manhattan distance
Example:
Distance between
h1 (s) = 7,
2
5
1
4
8
7
3
6
and
1
2
3
4
5
6
7
8
h2 (s) = 1 + 1 + 1 + 1 + 2 + 0 + 3 + 1 = 10
h1 and h2 are admissible!
Comparison of search algorithms
I I
Implemented in Mathematica Averaged over 132 randomly generated 8-puzzle problems
Comparison ?
A -Algorithm
Iterative Deepening Heuristic Depth
Steps
2
20
4
81
6
806
8
6455
10
50512
12 486751
Time [sec]
Steps
0.003 0.013 0.13 1.0 7.9 75.7
3.0 5.2 10.2 17.3 48.1 162.2
h1
Heuristic
Time [sec]
0.0010 0.0015 0.0034 0.0060 0.018 0.074 IDA
14
16
18
− − −
10079.2 69386.6 708780.0
Steps
2.6 19.0 161.6
3.0 5.0 8.3 12.2 22.1 56.0
h2 0.0010 0.0022 0.0039 0.0063 0.011 0.031
Heuristic h1 : 1.5
10 24 19 14 15 12
?
855.6 3806.5 53941.5
0.25 1.3 14.1
eective branching factor: uninformed search: 2.8
Num
Time [sec] Runs
Heuristic h2 : 1.3
16 13 4
Summary
I
I I I
For uninformed search, only iterative deepening is practically useful. Why? IDA? is complete, fast and memory ecient good heuristics greatly reduce the eective branching factor When the problem is unsolvable, what does a heuristic help?
How to nd good heuristics? I I
Manually, e.g. by simplication of the problem Automatic generation of heuristics by machine-learning techniques
Games with Opponents
I I I I I
Games for two players Chess, Checkers, Othello, Go deterministic, observable Card games: only partially observable, why? Zero-sum games: Win + Loss = 0
Minimax Search Characteristics of games: I I I I I I I I I
For chess, the average branching factor is around 30 to 35 50 moves per player: 30100 ≈ 10148 leaf nodes Real-time requirements Limited search depth heuristic evaluation function b , Player: Max, Opponent: Min Assumption: Opponent Min always makes the best move he can Max maximize the evaluation of his moves Min minimizes evaluation of his moves
A minimax game tree with look-ahead of 4 half-moves. 3
Max 3
Min 6
Max Min Evaluation
2
0
1
3 6
3
2 1
2
4 2
4
6 1
6
4
0 7 9 1 6 7 3 4 1 5 8 9 2 2 3 4 5 1 2 7 6 9 4
Alpha-Beta-Pruning An alpha-beta game tree with look-ahead of 4 half-moves. 3
Max
6
Max Min Evaluation I I
≤2 b
3
Min
0
1
3 6
3
0 7 9 1 6 7 3 4 1
2 ≤1 a
2
≤2 c
9 2 2
At every leaf node the evaluation function is calculated. For every maximum node the current largest child value is saved in α.
I
I
I
For every minimum node the current smallest child value is saved in β . If at a minimum node k the current value β ≤ α, then the search under k can end. Here α is the largest value of a maximum node in the path from the root to k . If at a maximum node l the current value α ≥ β , then the search under l can end. Here β is the smallest value of a minimum node in the path from the root to l .
AlphaBetaMax(Node, α, β ) DepthLimitReached(Node) Return(Rating(Node)) NewNodes = Successors(Node) While NewNodes 6= ∅ α = Maximum(α, AlphaBetaMin(First(NewNodes), α, β ) If α ≥ β Return(β ) NewNodes = Rest(NewNodes) Return(α) If
AlphaBetaMin(Node, α, β ) DepthLimitReached(Node) Return(Rating(Node)) NewNodes = Successors(Node) While NewNodes 6= ∅ β = Minimum(β , AlphaBetaMax(First(NewNodes), α, β ) If β ≤ α Return(α) NewNodes = Rest(NewNodes) Return(β )
If
I
I
The algorithm is an extension of depth-rst search with two functions for maximum- und minimum nodes which call themselves mutually. It uses the values dened above for α and β .
Complexity heavily depends on the order in which child nodes are traversed Worst-Case: does not oer any advantage, i.e. nd = b d Best-Case: Successors of maximum nodes are sorted in descending order, successors of minimum nodes are sorted in ascending order: √ Eective branching factor ≈ b . √ d nd = b = b d/2 Chess: branching factor reduces from 35 to about 6 Search horizon is doubled. Average-Case: branching factor ≈ b 3/4 Chess: branching factor reduces from 35 to about 14. ⇒ 8 half-moves ahead instead of 6 Heuristic node order: branching factor ≈ 7 bis 8
Computation time
Non-deterministic Games
I I I
e.g. dice games Max, dice, Min, dice, . . . , Average the values of all rolls
Heuristic Evaluation Functions I I
List of relevant features or attributes linear evaluation function B(s)
B(s) = a1 · material + a2 · pawn_structure + a3 · king_safety + a4 · knight_in_center + a5 · bishop_diagonal_coverage + (11) material = material(own_team) − material(opponent)
material(team) = num_pawns(team) · 100 + num_knights(team) · + num_bishops(team) · 300 + num_rooks(team) + num_queens(team) · 900 + . . . Weights ai are set intuitively after discussion with experts, better: optimizing weights by machine-learning methods
Learning of Heuristics
I I
I I
Expert is only asked about relevant features f1 (s), . . . , fn (s). machine-learning prozess is used for nding an optimal evaluation function B(f1 , . . . , fn ) Evaluation at the end, e.g.: victory, defeat, or draw learning algorithm updates evaluation function
Problem:
I Credit Assignment I I I
no ratings for individual moves positive or negative feedback only at the end Feedback for actions of the past?
young eld: Reinforcement Learning (engl. reinforcement learning) (see Sec. ). Most of the world-best chess computers still work without machine-learning techniques. Reasons: I I
Reinforcement learning has large computation times Manually created heuristics are already heavily optimized.
Latest research I Denition by Elaine Rich : Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better. Direct comparison of computer and human in a game
1950
Claude Shannon, Konrad Zuse, John von Neumann,
rst chess computers 1955 Arthur Samuel: Program that learns to play checkers on a IBM 701. archived games, every individual move had been rated by experts. Program plays against itself. Credit Assignment: For each individual position during a game it compares the evaluation by the function B(s) with the one calculated by alpha-beta pruning and changes B(s) accordingly. Alan Turing:
Latest research II 1961 Samuels checkers program beat the fourth-best checkers player in the USA. 1990 Tesauro: Reinforcement learning. learning backgammon program named TD-Gammon, which played at the world champion level (see Sec. ). 1997 IBM's Deep Blue defeated the chess world champion Gary Kasparov with a score of 3.5 games to 2.5 Deep Blue could on average compute 12 half-moves ahead with alpha-beta pruning. 2004 Hydra: Chess computer on parallel machine Uses 64 parallel Xeon processors with 3 GHz computing power and 1 GByte memory each. Software: Ch. Donninger (Austria) and U. Lorenz, Ch. Lutz (Germany).
Latest research III Position evaluation: FPGA Co-processor (Field Programmable Gate Arrays). Evaluates 200 million possitions per second. Hydra can on average compute about 18 half-moves ahead. Hydra often makes moves which grand champions cannot comprehend, but which in the end lead to victory. Alpha-beta search with relatively general, well-known heuristics and a good hand-coded position evaluation. Hydra works without learning. 2009 Pocket Fritz 4, running on a PDA, won the Copa Mercosur chess tournament in Buenos Aires: 9 wins and 1 draw against 10 excellent human chess players, three of them grandmasters. search engine HIARCS 1314 searches less than 20,000 positions per second
Latest research IV
Pocket Fritz 4 is about 10,000 slower than Hydra HIARCS = Higher Intelligence Auto Response System.
14 HIARCS =
Higher Intelligence Auto Response Chess System.
Chess
Today's challenge: Go
I I I I I I I
square board with 361 squares 181 white, 180 black stones average branching factor: about 300 after 4 half-moves: 8 · 109 positions classical game tree search processes have no chance! Pattern recognition on the board! Humans are still miles ahead of computer programs.
Teil VII
Reasoning with Uncertainty Reasoning with Uncertainty The Maximum Entropy Method LEXMED, a medical expert system for the diagnosis of appendicitis Reasoning with Bayesian Networks Summary
Flying Penguin 1. Tweety is a penguin 2. Penguins are birds 3. All birds can y Formalized in PL1 the knowledge base WB consists of penguin(tweety)
⇒ bird(x) bird(x) ⇒ y(x)
penguin(x)
It can be derived: 15 see
Sec.
y(tweety)
15
The ying penguin
New trial:
Penguins cannot y penguin(x)
⇒ ¬ y(x)
It can be derived: ¬y(tweety) But: It can also be derived: y(tweety) I I
The knowledge base is inconsistent. The logic is monotonic: I
new knowledge can not void old knowledge
Probabilistic logic
Uncertainty:
99% of all birds can y
Agent has incomplete information about the state of the world (real-time decisions)
Incompleteness:
Heuristic search Reasoning with uncertain or incomplete knowledge
Let's just sit back and think about what to do!
Example:
If a patient experiences pain in the right lower abdomen and a raised white blood cell (leukocyte) count, this raises the suspicion that might be appendicitis.
Stomach pain right lower ∧ Leukocytes > 10000 → Appendicitis On Stomach pain right lower ∧ Leukocytes > 10000 we can use modus ponens to derive Appendicitis.
MYCIN
I I I
1976, Shortlie and Buchanan Certainty factors represent the certainty of facts and rules A →β B via conditional probability β
Example: Stomach pain right lower
∧
Leukocytes
Appendicitis
I I I
> 10000 →0.6
Formulas for connecting the factors of rules Calculus is incorrect Inconsistent results could be derived
Other formalisms for modeling uncertainty
I I I
I
non-monotonic logics Default logic Dempster-Shafer theory: assigns a belief function Bel(A) to a logical term A Fuzzy logic ( ⇒ control theory)
Reasoning with conditional probabilities
I
I I I I I
Conditional probabilities instead of implication (material implication) Subjective probabilities Probability theory is well-founded Reasoning with uncertain and incomplete knowledge Maximum entropy method (MaxEnt) Bayesian networks
Calculating with Probabilities
Example I In dice games, the probability of throwing a six is I The probability of throwing an odd number is
1/6
1/2
Denition Let Ω be the set of possible outcomes of some events. Each ω ∈ Ω stands for a possible outcome of the experiment. If the wi ∈ Ω exclude each other, but cover all possible outcomes, they are called elementary events.
I
Throwing a die once:
Ω = {1, 2, 3, 4, 5, 6} I I
I I I I I
Throwing an even number {2, 4, 6} is not an elementary event Throwing a number smaller than 5 {1, 2, 3, 4} is not an elementary event Reason: {2, 4, 6} ∩ {1, 2, 3, 4} = {2, 4} = 6 ∅ With two events A and B , A ∪ B is an event. Ω is the sure event The empty set ∅ is the impossible event Instead of A ∩ B we write A ∧ B , because
x ∈ A ∩ B ⇔ x ∈ A ∧ x ∈ B.
Notation A∩B A∪B ¯ A Ω ∅ I I
I I
Propositional logic A∧B A∨B ¬A w f
Description Intersection / and Union / or Complement / negation Sure event / true Impossible event / false
A, B , etc.: random variables We consider only discrete random variables with nite value range At dicing, the number is discrete with the values 1,2,3,4,5,6. The probability, to throw a 5 or 6 is 1/3: P(number ∈ {5, 6}) = P(number = 5 ∨
number
= 6) = 1/3.
Denition Let Ω = {ω1 , ω2 , . . . , ωn } be nite. No elementary event is preferred, that means we assume a symmetry regarding the frequency of occurence of all elementary events. The probability P(A) of the event A is then dened by
P(A) =
|A| Number of outcomes favourable to A = |Ω| Number of possible outcomes
Example Throwing a die, the probability for an even number is
P(number ∈ {2, 4, 6}) =
|{2, 4, 6}| 3 1 = = . |{1, 2, 3, 4, 5, 6}| 6 2
I
I I I I I
Any elementary event has the probability 1/|Ω| (Laplace assumption) Applicable only at nite event sets Example: color of eyes with the values green, blue, brown color of eyes = blue describes the value of a variable binary (boolean) variables are propositions themselves Example: P(JohnCalls) instead of P(JohnCalls = t)
From this denition, some rules follow directly:
Theorem
1. P(Ω) = 1. 2. P(∅) = 0, i.e. the impossible event has probability 3. For pairwise inconsistent events A and B it holds P(A ∨ B) = P(A) + P(B). 4. For two complementary events A and ¬A it holds P(A) + P(¬A) = 1. 5. For arbitrary events A and B it is true that P(A ∨ B) = P(A) + P(B) − P(A ∧ B). 6. For A ⊆ B we have P(A) ≤ P(B). 7. If PAn 1 , . . . , An are the elementary events, it holds i=1 P(Ai ) = 1.
Proof as exercise
0.
Joint probability distribution for
P(A, B)
A, B
= (P(A, B), P(A, ¬B), P(¬A, B), P(¬A, ¬B))
Distribution in matrix form: P(A, B)
A=w A=f
P(A, B) = P(A ∧ B)
B=w B=f P(A, B) P(A, ¬B) P(¬A, B) P(¬A, ¬B)
Joint probability distribution
I I I I I I
d variables X1 , . . . , Xd with n values each: The distribution contains the values P(X1 = x1 , . . . , Xd = xd ) x1 , . . . , xd each may have n dierent values Creating a d -dimensional matrix with nd elements One of the nd values is redundant Distribution is characterized uniquely by nd − 1 values
Conditional probabilities
Example In the Doggenriedstraÿe in Weingarten velocities of 100 vehicles are measured.
Event
Frequency
Vehicle observed Driver is a student (S ) Velocity too high (V ) Driver is student and velocity too high (S
∩ V)
Relative freq.
100
1
30
0.3
10
0.1
5
0.05
Do students speed more frequently than the average person, or than non-students? Answer: conditional probability
P(V |S) =
|Driver
is a student and velocity too high|
|Driver
is a student|
=
5 1 = ≈ 0.17 30 6
Denition For two events A and B , the probability for A under the condition B (conditional probability) is dened by
P(A|B) =
P(A ∧ B) P(B)
P(A|B) = probability of A regarding event B only, i.e. P(A|B) =
|A ∧ B| . |B|
Proof:
P(A ∧ B) P(A|B) = = P(B)
|A ∧ B| |Ω| |B| |Ω|
=
|A ∧ B| . |B|
Denition If for two events A and B
P(A|B) = P(A), then these events are called independent.
Theorem For independent events
A
and
B,
it follows from the denition
that
P(A ∧ B) = P(A) · P(B). Proof?
Example The probability for two sixes is
1/36
if the dice are independent,
because
P(W1 = 6 ∧ W2 = 6) = P(W1 = 6) · P(W2 = 6) = If die 2 always falls the same as die 1, it holds
1 P(W1 = 6 ∧ W2 = 6) = . 6
1 1 1 · = , 6 6 36
Chain rule
Product rule:
P(A ∧ B) = P(A|B)P(B)
Chain rule: P(X1 , . . . , Xn )
= = = =
P(Xn |X1 , . . . , Xn−1 ) · P(X1 , . . . , Xn−1 )
P(Xn |X1 , . . . , Xn−1 ) · P(Xn−1 |X1 , . . . , Xn−2 ) · P(X1 , . . . , Xn−2 )
P(Xn |X1 , . . . , Xn−1 ) · P(Xn−1 |X1 , . . . , Xn−2 ) · . . . · P(X2 |X1 ) · P(X1 n Y i=1
P(Xi |X1 . . . , Xi−1 ),
(12
Marginalization Binary variables A and B behave
P(A) = P((A ∧ B) ∨ (A ∧ ¬B)) = P(A ∧ B) + P(A ∧ ¬B). In general:
P(X1 = x1 , . . . , Xd−1 = xd−1 ) =
X
P(X1 = x1 , . . . , Xd−1 = xd−1 , Xd =
xd
I
The resulting distribution
P(X1 , . . . , Xd−1 )
distribution I
Projection of a cube on one edge (margin)
is called
marginal
Example Leuko App
: Leukocyte value higher than 10000 : Patient has appendicitis (appendix inammation),
P(App, Leuko) App
¬App
total
0.31
0.54
Leuko
0.23
¬Leuko
0.05
0.41
0.46
0.28
0.72
1
total
For example, it holds:
P(Leuko) = P(App, Leuko) + P(¬App, Leuko) = 0.54. P(Leuko|App) =
P(Leuko, App) = 0.82 P(App)
Bayes' Theorem
P(A|B) =
P(A ∧ B) P(B)
as well as
P(B|A) =
P(A ∧ B) P(A)
Bayes' theorem:
P(A|B) =
P(B|A) · P(A) P(B)
(13)
Appendicitis example:
P(App|Leuko) would be much more interesting for the diagnosis of appendicitis, but is not published! I Why is P(Leuko|App) published, but P(App|Leuko) not? 0.82 · 0.28 P(Leuko|App) · P(App) P(App|Leuko) = = = 0.43 P(Leuko) 0.54 (14)
I
Bayes' Theorem: Example
I I I
Very reliable burglar alarm Reports any burglar with 99% certainty Thus with high certainty: If alarm then burglary!
Bayes' Theorem: Example
I I I I I
Very reliable burglar alarm Reports any burglar with 99% certainty Thus with high certainty: If alarm then burglary! No! P(A|E ) = 0.99, P(A) = 0.1, P(E ) = 0.001
I
P(E |A) =
P(A|E )P(E ) 0.99 · 0.001 = = 0.01 P(A) 0.1
The Maximum Entropy Method
I I I
Calculus for reasoning under uncertainty Often too little knowledge for solving the necessary equations Idea from E.T. Jaynes (Physicist): Maximize the entropy of the sought probability
[Jay57,Jay03] [Che83,Nil86,Kan89,KK92] Application to the LEXMED project distribution!
I
An inference rule for probabilities Modus Ponens:
A, A → B B
Generalization to probability rules
P(A) = α, P(B|A) = β P(B) = ? Given:
two probability values α, β ,
sought:
P(B)
Marginalization:
P(B) = P(A, B) + P(¬A, B) = P(B|A) · P(A) + P(B|¬A) · P(¬A). With classical probability theory only: P(B) ≥ P(B|A) · P(A).
Distribution P(A, B)
= (P(A, B), P(A, ¬B), P(¬A, B), P(¬A, ¬B))
Abbreviation
p1 p2 p3 p4
I I I
= = = =
P(A, B) P(A, ¬B) P(¬A, B) P(¬A, ¬B)
These four parameters (unknowns) dene the distribution. Out of it, any probability for A and B can be calculated. Four equations are required.
Normalization condition: p1 + p2 + p3 + p4 = 1
P(A, B) = P(B|A) · P(A) = αβ P(A) = P(A, B) + P(A, ¬B) System of equations:
p1 = αβ p1 + p2 = α p1 + p2 + p3 + p4 = 1 (??) in (??): (??) in (??):
p2 = α − αβ = α(1 − β) p3 + p4 = 1 − α
One equation is missing!
(15) (16) (17) (18) (19)
Solving an optimization problem Sought: Distribution p = (p3 , p4 ), which maximizes the entropy
H(p) = −
n X i=1
pi ln pi = −p3 ln p3 − p4 ln p4
under the constraint p3 + p4 = 1 − α (Equation
??).
Why should the entropy function be maximized? I I
I
The entropy measures the uncertainty of a distribution. Negative entropy is a measure for the information content of the distribution. Maximizing the entropy minimizes the information content of the distribution.
Two-dimensional entropy function with the constraint p3 + p4 = 1. 1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Maximizing the entropy I I
Constraint: p3 + p4 − 1 + α = 0 Method of Lagrange multipliers [BHW89]
Lagrange function:
L = −p3 ln p3 − p4 ln p4 + λ(p3 + p4 − 1 + α). ∂L = − ln p3 − 1 + λ = 0 ∂p3 ∂L = − ln p4 − 1 + λ = 0 ∂p4 p3 = p4 =
1−α . 2
P(B) = P(A, B)+P(¬A, B) = p1 +p3 = αβ +
1−α 1 1 = α(β − )+ 2 2 2
1 1 P(B) = P(A)(P(B|A) − ) + . 2 2
Substituting α and β :
P(B) = P(A) ( P(B|A) - 1/2 ) + 1/2 1
P(B|A)=1 P(B|A)=0.9
0.8
P(B|A)=0.8 P(B|A)=0.7 P(B|A)=0.6
P(B)
0.6
P(B|A)=0.5 0.4
P(B|A)=0.4 P(B|A)=0.3
0.2
P(B|A)=0.2 P(B|A)=0.1
0
P(B|A)=0 0
0.2
0.4
0.6 P(A)
0.8
1
Theorem 16
Let there be a consistent
set of linear probabilistic equations.
Then there exists a unique maximum for the entropy function with the given equations as constraints. The MaxEnt distribution thereby dened has minimum information content under the constraints.
I
I
There is no other distribution, which satises the constraints while having lower entropy. A calculus, which leads to distributions with a higher entropy is adding informations ad hoc, which again is not justied.
16 A set of probabilistic equations is called consistent if there is at least one solution, that is, one distribution which satises all equations.
I I I
p3 and p4 always occur symmetrically Therefore, p3 = p4 (indierence) In general:
Denition If an arbitrary exchange of two or more variables in the Lagrange equations results in equivalent equations, these variables are called indierent.
Theorem If a set of variables
{pi1 , . . . pik }
is indierent, then the
maximum of the entropy under the given constraints is at the point where
pi1 = pi2 = . . . = pik .
Maximum Entropy Without Explicit Constraints
I I
I I I
No knowledge given No constraints beside the normalisation condition p1 + p2 + . . . + pn = 1 All varibles are indierent p1 = p2 = . . . = pn = 1/n. (Aufgabe ??) All worlds are equally probable
Special case: two variables
and
A
B
P(A, B) = P(A, ¬B) = P(¬A, B) = P(¬A, ¬B) = 1/4, P(A) = P(B) = 1/2 and P(B|A) = 1/2
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
A further example given:
P(B|A) = β P(A, B) = P(B|A)P(A) = βP(A)
and
p1 = β(p1 + p2 ) constraints
βp2 + (β − 1)p1 = 0 p1 + p2 + p3 + p4 − 1 = 0. I I
No symbolic solution! Solving the lagrange equations numerically
0.4 0.35 0.3 0.25 0.2 0.15 p1 p2 p3 p4
0.1 0.05 0 0
0.2
0.4
0.6 P(B|A)
p1 , p2 , p3 , p4 depending on β .
0.8
1
Conditional probability versus material implication
A B A ⇒ B w w w w f f f w w f f w
P(A) P(B) 1 1 1 0 0 1 0 0
P(B|A) 1 0 undened undened
Question:
What value does P(B|A) have, if only P(A) = α and P(B) = γ are given?
Conditional probability versus material implication
p1 = P(A, B),
p2 = P(A, ¬B),
p3 = P(¬A, B),
p4 = P(¬A, ¬B)
constraints (20) (21) (22)
p1 + p2 = α p1 + p3 = γ p1 + p2 + p3 + p4 = 1 Maximizing the entropy (see Aufgabe
p1 = αγ, I I
p2 = α(1 − γ),
??):
p3 = γ(1 − α),
p4 = (1 − α)(1 − γ).
From p1 = αγ follows P(A, B) = P(A) · P(B) Independence of A and B
From the denition
P(B|A) =
P(A, B) P(A)
for P(A) 6= 0 follows
P(B|A) = P(B) For P(A) = 0, P(B|A) stays undened.
A B A ⇒ B w w w f w w
P(A) P(B) α β 0 β
P(B|A) β undened
MaxEnt-Systems I I I
I I
I
Often, MaxEnt optimization has no symbolic solution Therefore: numerical entropy maximization SPIRIT: Fernuni Hagen (distance teaching university) [RM96,BK00] PIT: TU Munich [Sch96,ES99,SE00] PIT uses Sequential Quadratic Programming (SQP) to nd an extremum of the entropy function numerically As input, PIT expects a le with the constraints:
var A{t,f}, B{t,f}; P([A=t]) = 0.6; P([B=t] | [A=t]) = 0.3; QP([B=t]); QP([B=t] | [A=t]);
I
Request QP([B=t]) Web front end on www.pit-systems.de Result: Nr Truthvalue Probability Query 1 UNSPECIFIED 3.800e-01 QP([B=t]); 2 UNSPECIFIED 3.000e-01 QP([A=t]-|> [B=t]);
I
P(B) = 0.38 and P(B|A) = 0.3
I I
The Tweety example P(bird|penguin) = 1 P(ies|bird) ∈ [0.95, 1] P(ies|penguin) = 0
penguins are birds (almost all) birds can y penguins cannot y
PIT input le:
var penguin{yes,no}, bird{yes,no}, flies{yes,no};
P([bird=yes] | [penguin=yes]) = 1; P([flies=yes] | [bird=yes]) IN [0.95,1]; P([flies=yes] | [penguin=yes]) = 0; QP([flies=yes]| [penguin=yes]); Answer: Nr Truthvalue 1 UNSPECIFIED
Probability 0.000e+00
Query QP([penguin=yes]-|> [ies=yes]);
MaxEnt and non monotonicity
I I
I I
I
Probability intervals are often very helpful second rule in the sense of normally birds y: P(ies|bird) ∈ (0.5, 1] MaxEnt enables non monotonic inference MaxEnt is also successful on challenging benchmarks for non monotonic inference [Sch96] Application of MaxEnt within the medical expert system LEXMED
LEXMED, a medical expert system for the diagnosis of appendicitis
Manfred Schramm, Walter Rampf, Wolfgang Ertel Ravensburg-Weingarten University of Applied Sciences, Hospital 14-Nothelfer, Weingarten Technical University Munich [SE00,Le99] LEXMED = Medical expert system capable of learning.
The project was funded by the german state of Baden-Wuerttemberg, the AOK Baden-Württemberg, the Ravensburg-Weingarten University of Applied Sciences and by the hospital 14 Nothelfer in Weingarten.
LEXMED Query form
LEXMED Answer
Result of the PIT diagnosis Diagnosis
App. inamed
App. perforated
Negative
Other
Probability
0.70
0.17
0.06
0.07
Diagnosis of appendicitis with formal methods
I
I I
I
I
Most frequent cause for acute stomach pains [Dom91]: Appendizitis Dicult diagnosis [Ohm95]. Approx. 20% of the surgically removed appendixes without clinical abnormalities There are also cases, where an inamed appendix is not recognized In 1972, de Dombal (Great Britain) developed an expert system for the diagnosis of acute stomach pain [Dom72,Ohm94,Ohm95].
Linear Scores
With n symptoms S1 , . . . , Sn , a score can formally be dened as Appendicitis if w1 S1 + . . . + wn Sn > Θ Diagnosis = Negative else I I I
I
Scores are too weak for the modelling of complex relations Score systems cannot consider contexts E.g. they cannot distinguish between the leukocyte values of elderly and medium age people They demand high requirements on databases (representative)
Hybrid probabilistic knowledge base
Query to the expert system: I
What is the probability for an inamed appendix, if the patient is a 23 year old man with pain in the in downright stomach and a leukocyte value of 13000?
P(Bef4 = inamed ∨
Gender
= perforated | = male ∧ Age ∈ 21-25 ∧
Bef4
Leuko
∈ 12k-15k).
Symptoms Symptom
Values
#
Gender
male, female
2
Age Pain 1st Quad. Pain 2nd Quad. Pain 3rd Quad. Pain 4th Quad. Guarding Rebound tenderness Pain on tappping Rectal pain Bowel sounds Abnormal ultrasound Abnormal urine sedim. Temperature (rectal) Leukocytes Medical ndings
0-5,6-10,11-15,16-20,21-25,2635,36-45,46-55,56-65,65-
yes, no yes, no yes, no yes, no local, global, none yes, no yes, no yes, no weak, normal, increased, none yes, no yes, no
-37.3, 37.4-37.6, 37.7-38.0, 38.138.4, 38.5-38.9, 39.00-6k, 6k-8k, 8k-10k, 10k-12k, 12k15k, 15k-20k, 20k-
inamed, perforated, negative, other
10
Short
Sex2 Age10
6
P1Q2 P2Q2 P3Q2 P4Q2 Gua3 Reb2 Tapp2 RecP2 BowS4 Sono2 Urin2 TRec6
7
Leuko7
4
Bef4
2 2 2 2 3 2 2 2 4 2 2
Knowledge bases
Database
Expert knowledge
Dr. Rampf 15000 patients from Baden-Württemberg 1995
Dr. Hontschik
Knowledge processing physician
query
diagnosis
complete probability distribution
automatic
completion
P(Leuko > 100 | App=positive) = 0.7 almost automatic
database
MaxEnt model
rule base 500 rules
manual
expert
System architecture query/ symptoms
diagnosis user interface diagnosis
symptoms
store cost matrix
weighting
load
patient management
probabilities
runtime system query
answer
PIT
doctor−specif. patient−database (private)
probability distribution experts MaxEnt completion literature
database
rule− induction
rule set
knowledge modelling
Probability distribution
I
Size of the distribution:
210 · 10 · 3 · 4 · 6 · 7 · 4 = 20 643 840 I I
I I
20 643 839 independent values. Any rule set with less than 20 643 839 probability values may not describe the event space completely. A complete distribution is required. A human expert can not deliver 20 643 839 values!
Functionality of LEXMED
Probability statements:
P(Leuko > 20000 |Bef4 = inamed) = 0.09
17
17 Instead of single numerical values, we might also use intervals (i.e.
[0.06, 0.12]).
The dependency graph
Learning the rules by statistical induction
I I
I I I
Estimating the rule probabilities Structure of the dependency graph = structure of the learned rules (as a bayesian network) A priori rules, i.e. Equation ??, Rules with a single condition, i.e. Equation ?? Rules: diagnosis and two symptoms (Equation ??)
P(Bef4 = inamed) = 0.40 (23) P(Sono = ja|Bef4 = inamed) = 0.43 (24) P(S4Q = ja|Bef4 = inamed ∧ S2Q = ja ) = 0.61 (25)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
P([Leuco7=0-6k] P([Leuco7=6-8k] P([Leuco7=8-10k] P([Leuco7=10-12k] P([Leuco7=12-15k] P([Leuco7=15-20k] P([Leuco7=20k-] P([Leuco7=0-6k] P([Leuco7=6-8k] P([Leuco7=8-10k] P([Leuco7=10-12k] P([Leuco7=12-15k] P([Leuco7=15-20k] P([Leuco7=20k-]
| | | | | | | | | | | | | |
[Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ] [Diag4=negativ]
* * * * * * * * * * * * * *
[Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=16-20]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25]) [Age10=21-25])
= = = = = = = = = = = = = =
[0.132,0.156]; [0.257,0.281]; [0.250,0.274]; [0.159,0.183]; [0.087,0.112]; [0.032,0.056]; [0.000,0.023]; [0.132,0.172]; [0.227,0.266]; [0.211,0.250]; [0.166,0.205]; [0.081,0.120]; [0.041,0.081]; [0.004,0.043];
Some LEXMED rules with probability intervals. * stands for ∧ . Estimating the probability by counting the frequency:
|Bef4 = inamed ∧ Sono |Bef4 = inamed|
= yes|
= 0.43.
Risk management using the cost matrix
Result of the PIT diagnosis Diagnosis
App. inamed
App. perforated
Negative
Other
Probability
0.24
0.16
0.55
0.05
What to do?
The cost matrix Probability of various diagnoses Therapy Operation Emergency operation Ambulant observ. Other Stationary observ. I I I I I
I
inamed
perforated
negative
other
0.25 0 500 12000 3000 3500
0.15 500 0 150000 5000 7000
0.55 5800 6300 0 1300 400
0.05 6000 6500 16500 0 600
Optimal decisions have (additional) costs 0. Goal: therapy with the smallest average cost of error. Probability vector = (0.25, 0.15, 0.55, 0.05) Expected cost for wrong decisions. For the rst line: (operation): matrix · vector = 0.25 · 0 + 0.15 · 500 + 0.55 · 5800 + 0.05 · 6000 = 3565. Cost-oriented agent
3565 3915 26325 2215 2175
Cost matrix in the binary case I
Diagnosis:
Appendicitis
and
NSAP
P(Appendicitis) = p1 P(NSAP) = p2 I I
Therapies: operation, ambulant observ. (send patient home) Cost matrix: 0 k2 . k1 0
I false positive, false negative
0 k2 k1 0
p1 k2 p2 · = p2 k1 p1
1/k1 (k2 p2 , k1 p1 ) = ((k2 /k1 )p2 , p1 )
same result with
0 k 1 0
,
Risk management. I I I
Working point of the diagnosis system k = 0: System congured extremely risky k → ∞ All patients are operated
Performance Measures
Sensitivity = P(classied positive|positive)
=
|positive and classied positive| , |positive|
Specity = P(classied negative|negative)
=
|negative and classied negative| |negative|
Performance / ROC-curve 1
Sensitivity
0.8
0.6
0.4
LEXMED Ohmann−Score Score w. LEXMED−data RProp w. LEXMED−data random decision
0.2
0 0
0.2
0.4
0.6
1 − Specifity
0.8
1
Application eld and experience
I I
I I I I I I I
Use in the diagnosis Quality assurance: comparing the diagnosis quality of hospitals with expert systems. Since 1999 in use in the 14-Nothelfer hospital in Weingarten www.lexmed.de Diagnosis quality is comparable to an experienced surgeon Commercial marketing very dicult Wrong time? Wish of patients for personal care! Since de Dombal 1972, 39 years passed. Will it take another 39 years?
Reasoning with Bayesian Networks
I I I
d variables X1 , . . . , Xd with n values each Probability distribution has nd − 1 values. In practice the distribution contains many redundancies.
Independent variables
P(X1 , . . . , Xd )
= P(X1 ) · P(X2 ) · . . . · P(Xd ).
conditional probabilities become trivial:18
P(A|B) =
I
P(A, B) P(A)P(B) = = P(A) P(B) P(B)
Example, from [RN03]
18 Naive-Bayes method! (see
Sec.
).
The alarm example
Example (J. Pearl)
Knowledge base:
P(J |Al ) = 0.90 P(J |¬Al ) = 0.05
P(M |Al ) = 0.70 P(M |¬Al ) = 0.01
P(Al |Bur , Ear ) P(Al |Bur , ¬Ear ) P(Al |¬Bur , Ear ) P(Al |¬Bur , ¬Ear ) A priori probabilities:
Requests:
= = = =
0.95 0.94 0.29 0.001,
P(Bur ) = 0.001
P(Bur |J ∨
M)
P(J |Bur )
P(Ear ) = 0.002. P(M |Bur )
Graphical representation of knowledge as bayesian network
Burglary
P (Bur ) 0.001
Alarm
John
Al P (J ) w 0.90 f 0.05
Earthquake
P (Ear ) 0.002
Bur Ear P (Al ) w w 0.95 w f 0.94 f w 0.29 f f 0.001
Mary
Al P (M ) w 0.70 f 0.01
Conditional independence
Denition Two variables A and B are called conditionally given C if P(A, B|C ) = P(A|C ) · P(B|C ).
independent,
Examples: P(J , M |Al ) P(J , Bur |Al )
= P(J |Al ) · P(M |Al ) = P(J |Al ) · P(Bur |Al )
Theorem The following equations are pairwise equivalent, which means that each individual equation describes the conditional independence for the variables
P(A, B|C )
= P(A|B, C ) = P(B|A, C ) =
A
and
B
given
C.
P(A|C ) · P(B|C ) P(A|C )
P(B|C )
Proof: P(A, B, C )
= P(A, B|C )P(C ) = P(A|C )P(B|C )P(C )
P(A, B, C )
= P(A|B, C )P(B|C )P(C ).
Thus: P(A|B, C )
= P(A|C )
(26) (27) (28)
Practical application
P(J |Bur ) =
P(J , Bur ) P(J , Bur , Al ) + P(J , Bur , ¬Al ) = P(Bur ) P(Bur )
(29)
and P(J , Bur , Al )
= P(J |Bur , Al )P(Al |Bur )P(Bur ) = P(J |Al )P(Al |Bur )P( (30)
P(J |Al )P(Al |Bur )P(Bur ) + P(J |¬Al )P(¬Al |Bur )P(Bur ) P(Bur ) = P(J |Al )P(Al |Bur ) + P(J |¬Al )P(¬Al |Bur ). (31)
P(J |Bur ) =
P(Al |Bur ) und P(¬Al |Bur ) fehlen!
P(Al , Bur ) P(Al , Bur , Ear ) + P(Al , Bur , ¬Ear ) = P(Bur ) P(Bur ) P(Al |Bur , Ear )P(Bur )P(Ear ) + P(Al |Bur , ¬Ear )P(Bur ) = P(Bur ) = P(Al |Bur , Ear )P(Ear ) + P(Al |Bur , ¬Ear )P(¬Ear ) = 0.95 · 0.002 + 0.94 · 0.998 = 0.94
P(Al |Bur ) =
P(J |Bur ) = 0.9 · 0.94 + 0.05 · 0.06 = 0.849.
Analogue: P(M |Bur ) = 0.659
P(J , M |Bur ) = P(J , M |Al )P(Al |Bur ) + P(J , M |¬Al )P(¬Al |Bur )) = P(J |Al )P(M |Al )P(Al |Bur ) + P(J |¬Al )P(M |¬Al )P(¬ = 0.9 · 0.7 · 0.94 + 0.05 · 0.01 · 0.06 = 0.5922
P(J ∨ M |Bur ) = P(¬(¬J , ¬M )|Bur ) = 1 − P(¬J , ¬M |Bur ) = 1 − [P(¬J |Al )P(¬M |Al )P(Al |Bur ) + P(¬J |¬Al )P(¬M |¬Al )P(¬ = 1 − [0.1 · 0.3 · 0.94 + 0.95 · 0.99 · 0.06] = 1 − 0.085 = 0.915. P(Bur |J ) =
P(J |Bur )P(Bur ) 0.849 · 0.001 = = 0.016 P(J ) 0.052 P(Bur |M ) = 0.056
(see Aufgabe
??).
P(Bur |J , M ) = 0.284
P(J |Bur ) = P(J |Al )P(Al |Bur ) + P(J |¬Al )P(¬Al |Bur )
Conditioning:
P(A|B) =
X c
P(A|B, C = c)P(C = c|B).
Software for bayesian networks: PIT 1 2 3 4 5 6 7 8 9 10 11 12 13 14
var Alarm{t,f}, Burglary{t,f}, Earthquake{t,f}, John{t,f}, Mary{t,f}; P([Earthquake=t]) = 0.002; P([Burglary=t]) = 0.001; P([Alarm=t] | [Burglary=t] AND [Earthquake=t]) P([Alarm=t] | [Burglary=t] AND [Earthquake=f]) P([Alarm=t] | [Burglary=f] AND [Earthquake=t]) P([Alarm=t] | [Burglary=f] AND [Earthquake=f]) P([John=t] | [Alarm=t]) = 0.90; P([John=t] | [Alarm=f]) = 0.05; P([Mary=t] | [Alarm=t]) = 0.70; P([Mary=t] | [Alarm=f]) = 0.01;
= = = =
0.95; 0.94; 0.29; 0.001;
QP([Burglary=t] | [John=t] AND [Mary=t]);
Response: P([Burglary=t] | [John=t] AND [Mary=t]) = 0.2841.
Bayesian Networks and MaxEnt
I
I
On input of CPTs or equivalent rules, the MaxEnt principle implies the same conditional independences and thus also the same answers as a Bayesian network.[Sch96] Thus Bayesian Networks are a special case of MaxEnt
Software for bayesian networks:JavaBayes
Software for bayesian networks: Hugin
I I I
More powerful is the professional tool Hugin Continuous variables possible Can also learn Bayesian networks, that is, generate the network fully automatically from statistical data
Development of bayesian networks I
For the variables v1 , . . . , vn with |v1 |, . . . , |vn | dierent values each, the distribution has a total of n Y i=1
I I
|vi | − 1
independent entries. Alarm example: 25 − 1 = 31 independent entries. For a node vi with ki parent nodes ei1 , . . . , eiki , the associated CPT has ki Y (|vi | − 1) |eij | j=1
entries.
All CPTs in the network together have n X i=1
(|vi | − 1)
ki Y j=1
|eij |
19
(32)
Alarm-example:
2 + 2 + 4 + 1 + 1 = 10
19 For the case of a node without ancestors we substitute the value 1 because the CPT for nodes without ancestors contains, with its a priori probability, exactly one value.
Special case
I I I I I I I
n variables Equal number b of values Each node has k parent nodes All CPTs together have n(b − 1)b k entries Complete distribution contains b n − 1 entries Local connection Network becomes modularized
LEXMED as bayesian network Leuko7
S3Q2
TRek6
PathU2
Abw3
RektS2
Alt10
Bef4
S2Q2
Losl2
S1Q2
Sono2
S4Q2
Ersch2
Darmg4
Sex2
I I
Size of the distribution: 20 643 839 Size of the bayesian network: 521 values
In the order (Leuko, TRek, Abw, Alt, Losl, Sono, Ersch, Darmg, Sex, S4Q, S1Q, S2Q, RektS, PathU, S3Q, Bef4) we calculate n X i=1
(|vi | − 1)
ki Y j=1
|eij |
= 6·6·4+5·4+2·4+9·7·4+1·3·4+1·4+1·2·4 +3 · 3 · 4 + 1 · 4 + 1 · 4 · 2 + 1 · 4 · 2 + 1 · 4 + 1 · 4 + 1 · 4 +1 · 4 + 1 = 521
Causality and Network Structure
Construction of a Bayesian network: 1. Design of the network structure usually performed manually 2. Entering the probabilities in the CPTs usually automated I I I I
Causes: burglary and earthquake Symptoms: John and Mary Alarm: hidden variable Considering causality: going from cause to eect
Burglary
Earthquake
Burglary
Earthquake
Burglary
Alarm
Earthquake
Alarm
John
Earthquake
Burglary
Alarm
John
Mary
Stepwise construction of the alarm network considering causality. Compare: Aufgabe ??
Semantics of bayesian networks
A
B
A
C
C C
A
B
B
There is no edge between A and B if they are independent (left) or conditionally independent (middle, right).
Requirements:
I I
Bayesian network has no cycles No variable has a lower index than any variable that predecessor
It holds P(Xn |X1 , . . . , Xn−1 )
= P(Xn |Parents(Xn )).
Theorem A node in a Bayesian network is conditionally independent from all non-successor nodes, given its parents.
If the parent nodes E1 and E2 are given, then all non-successor nodes B1 , . . . B8 are independent of A. B2 B3
B1 E1
E2 A
B5 B7
N1 N3
B4
B6 N2
N4
B8
Chain rule for bayesian networks: P(X1 , . . . , Xn )
=
n Y i=1
Thus, Equation
??
P(Xi |X1 . . . , Xi−1 )
=
n Y i=1
P(Xi |Parents(Xi ))
holds
P(J , Bur , Al )
= P(J |Al )P(Al |Bur )P(Bur )
The most important concepts and basics of Bayes-Networks are now known and we can summerize them [Jen01]:
Denition A Bayesian network is dened by:
• A set of variables (nodes) and a set of directed edges between these variables. • Each variable has nitely many possible values.
• The variables together with the edges form a directed acyclic graph (DAG). A DAG is a graph without cycles, that is, without paths of the form (A, . . . , A). • For every variable A the CPT (that is, the table of conditional probabilities P(A|Parents(A))) is given. Two variables A and B are called conditionally independent given C if P(A, B|C ) = P(A|C ) · P(B|C ) or, equivalently, if P(A|B, C ) = P(A|C ). Besides the basic rules of computation for probabilies, the following rules are also true:
Bayes' theorem:
P(A|B) =
P(B|A) · P(A) P(B)
Marginalization:
P(B) = P(A, B) + P(¬A, B) = P(B|A) · P(A) + P(B|¬A) · P(¬A) Conditioning:
P(A|B) =
P
c
P(A|B, C = c)P(C = c|B)
A variable in a Bayesian network is conditionally independent of all non-successor variables given its parent variables. IfX1 , . . . , Xn−1 are no successors of Xn , we have P(Xn |X1 , . . . , Xn−1 ) = P(Xn |Parents(Xn )). This condition must be honored during the construction of a network. During construction of a Bayesian network the variables should be ordered according to causality. First the causes, then the hidden variables, and the diagnosis variables last.
See also: d-Separation in [Pea88] and [Jen01].
Summary
I I I I I I
I I I
Probabilistic logic for reasoning under uncertain knowledge Method of maximum entropy models non-monotonic reasoning Bayesian networks as special case of MaxEnt Bayesian networks rely on independence assumptions In a Bayesian network, all CPTs must be lled completely With MaxEnt, arbitrary knowledge can be formulated. E.g.: I am pretty sure that A is true.: P(A) ∈ [0.6, 1] Inconsistency: P(A) = 0.7 and P(A) = 0.8 PIT recognizes inconsistency In some cases reasoning is possible anyway
Combination of MaxEnt and bayesian networks
1. 2. 3. 4.
Constructing a network by bayesian methodology Missing values for CPTs can be replaced by intervals or by other probabilistic formulas MaxEnt system completes the ruleset
LEXMED
I I I
I
Medical expert system LEXMED Better than linear score systems Scores are equivalent to the special case Naive-Bayes, that is, to the assumption that all symptoms are conditionally independent given the diagnosis (see Sec. , Aufgabe ??). LEXMED can learn knowledge from data given in a database (see Ch. ).
Outlook
I
I I I I I
I
Nowadays, bayesian inference is very important and well-developed Handling continuous variables possible under restrictions Causality in bayesian networks Undirected bayesian networks Literature: [Pea88; Jen01; Whi96; DHS01] Association for Uncertainty in Articial Intelligence (AUAI) (www.auai.org). UAI conference
Teil VIII
Machine Learning and Data Mining The Perceptron, a Linear Classier The Nearest Neighbour Method Decision tree learning Learning of bayesian networks The Naive Bayes classier
Why machine learning? Elaine Rich [Ric83]: Articial Intelligence is the study of how to make computers do things at which, at the moment, people are better.,
and: Humans learn better than computers
⇒ Machine learning is very important for AI
Why machine learning?
I
I
Complexity of software development I
Behaviour of an autonomous robot
I
Expert systems
I
Spam lter
Solution: Hybrid software with programmed and learned components.
What is learning?
I I I I
Learning vocabulary of a foreign language? Memorizing a poem? Acquisition of mathematical skills? Learning skiing?
Sorting device for apples Features:
Size and color
Classication task:
(Classier)
Sort apples into merchandise classes A and B
Size [cm] Color [0: green, 1: red] Merchandise class
8 0.1 B
8 0.3 A
6 0.9 A
3 0.8 B
Training data for the apple sorting agent.
... ... ...
size − −
large −
−
+ + + + + + + + + + + + +
+
class A
−
class B
+ +
−
+ −
− + + + + + − − − + − − − − − − − − − − −− − − − − − − − − − − − − − − − −
small
green
red
color
Apple sorting equipment and some apples classied into merchandise classes A and B in feature space.
Curve divides the classes size − −
large −
−
−
small
− − − − − − − −− − − − − − − −
+ + + + + + + + + + + + +
+
class A
−
class B
+ + + −
+ + + + + − + − − − − − − − − − − − − − − −
green
red
color
In practice: 30 or more features!
n
Features:
In the n-dimensional feature space, a (n − 1)-dimensional hyperplane has to be found, which divides the two classes as well as possible.
Terms
maps a feature vector to a class value. Fixed amount of alternatives. Example: sorting apples. Classier:
maps a feature vector to a real number. forecasting share prices out of given feature values.
Approximation: Example:
The learning agent
color Agent size
classified apples (learning)
merchandize class
m
The learning agent in general
feature 1 ...
Agent
feature n
training data (learning)
class label / function value
Denition of machine learning
Machine Learning is the study of computer algorithms that improve automatically through experience.[Mit97]
Drawing on this, we dene:
Denition An agent is a learning agent if it improves its performance (measured by a suitable criterion) on new, unknown data over time (after it has seen many training examples).
Terms (apple sorting example) learning the mapping from size and color of an apple to a merchandise class. Performance measure: amount of correctly classied apples. Variable agent (more precisely a class of agents): the learning algorithm determines the class of all learnable functions. Training data (experience): the training data contains the knowledge which the learning algorithm is supposed to extract. Test data: evaluates whether the trained agent can generalize well from the training data to new data. Task:
Data mining
What is data mining? I I I I
Extract knowledge from training data Make knowledge understandable to humans Example: decision tree induction Knowledge management: I
Analyzing customer preferences, e.g. in e-shops
I
Selective advertising
Denition The process of acquiring knowledge from data, as well as its representation and application, is called data mining . Literature: I. Witten und E. Frank. Data Mining. Von den Autoren in Java entwickelte DataMining Programmbibliothek WEKA: (www.cs.waikato.ac.nz/~ml/weka). Hanser Verlag München, 2001
Data analysis LEXMED data, N = 473 patients: Var. num.
Description
Values
1
age
continuous
2
sex (1=male, 2=female)
1,2
3
pain quadrant 1
0,1
4
pain quadrant 2
0,1
5
pain quadrant 3
0,1
6
pain quadrant 4
0,1
7
local muscular guarding
0,1
8
generalized muscular guarding
0,1
9
rebound tenderness
0,1
10
pain on tapping
0,1
11
pain during rectal examination
0,1
12
axial temperature
continuous
13
rectal temperature
continuous
14
leucocytes
continuous
15
diabetes mellitus
0,1
16
appendicitis
0,1
x1 = (26, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 37.9, 38.8, 23100, 0, 1) x2 = (17, 2, 0, 0, 1, 0, 1, 0, 1, 1, 0, 36.9, 37.4, 8100, 0, 0) Mean
N 1 X p x x¯i := N p=1 i
Standard deviation
v u u si := t Covariance
N
1 X p (x − x¯i )2 . N − 1 p=1 i N
1 X p σij = (x − x¯i )(xjp − x¯j ) N − 1 p=1 i
Correlation coecient:
Kij =
1. −0.009 0.14 0.037−0.096 0.12 0.018 −0.009 1. −0.0074−0.019−0.06 0.063−0.17 0.14 −0.0074 1. 0.55 −0.091 0.24 0.13 0.037−0.019 0.55 1. −0.24 0.33 0.051 −0.096−0.06 −0.091 −0.24 1. 0.059 0.14 0.12 0.063 0.24 0.33 0.059 1. 0.071 0.018−0.17 0.13 0.051 0.14 0.071 1. 0.051 0.0084 0.24 0.25 0.034 0.19 0.16 −0.034−0.17 0.045 0.074 0.14 0.086 0.4 −0.041−0.14 0.18 0.19 0.049 0.15 0.28 0.034−0.13 0.028 0.087 0.057 0.048 0.2 0.037−0.017 0.02 0.11 0.064 0.11 0.24 0.05 −0.034 0.045 0.12 0.058 0.12 0.36 −0.037−0.14 0.03 0.11 0.11 0.063 0.29 0.37 0.045 0.11 0.14 0.017 0.21 −0.0001 0.012−0.2 0.045 −0.0091 0.14 0.053 0.33
σij . si · sj
0.051−0.034 −0.041 0.034 0.037 0.05−0.037 0.37 0.012 0.0084 −0.17−0.14−0.13−0.017 −0.034 −0.14 0.045 −0.2 0.24 0.045 0.18 0.028 0.02 0.045 0.03 0.11 0.045 0.25 0.074 0.19 0.087 0.11 0.12 0.11 0.14 −0.0091 0.034 0.14 0.049 0.057 0.064 0.058 0.11 0.017 0.14 0.19 0.086 0.15 0.048 0.11 0.12 0.063 0.21 0.053 0.16 0.4 0.28 0.2 0.24 0.36 0.29 −0.0001 0.33 1. 0.17 0.23 0.24 0.19 0.24 0.27 0.083 0.084 0.17 1. 0.53 0.25 0.19 0.27 0.27 0.026 0.38 0.23 0.53 1. 0.24 0.15 0.19 0.23 0.02 0.32 0.24 0.25 0.24 1. 0.17 0.17 0.22 0.098 0.17 0.19 0.19 0.15 0.17 1. 0.72 0.26 0.035 0.15 0.24 0.27 0.19 0.17 0.72 1. 0.38 0.044 0.21 0.27 0.27 0.23 0.22 0.26 0.38 1. 0.051 0.44 0.083 0.026 0.02 0.098 0.035 0.044 0.051 1. −0.0055 0.084 0.38 0.32 0.17 0.15 0.21 0.44 −0.0055 1.
Correlation matrix for the 16 appendicitis variables measured in 473 cases.
Correlation matrix as density plot
Kij = −1:
black,
Kij = 1:
white
|Kij | = 0:
black,
|Kij | = 1:
white
The Perceptron, a Linear Classier
I I
Linearly separable two dimensional data set. Equation for the dividing line: a1 x1 + a2 x2 = 1
x2 1/a2
− −
− − −
+ + + + + + + + +
− − +
− −
− −
− −
− − − − − − −
1/a1
x1
Linear separability A n − 1-dimensional hyperplane in Rn is given by n X
ai xi = θ
i=1
thus we dene:
Denition Two sets M1 ⊂ Rn and M2 ⊂ Rn are called if real numbers a1 , . . . , an , θ exist with n X i=1
ai xi > θ
for all x ∈ M1
and
The value θ is denoted the threshold.
n X i=1
linearly separable
ai xi ≤ θ
for all x ∈ M2
Linear separability
x2
x2
AND
XOR
1
1
−x1 + 3/2 x1
x1 0
1
0
1
AND is linearly separable, but XOR is not(• =true ˆ ,
◦ =false ˆ ).
Perceptron P (x)
w1 x1
w2 x2
w3 x3
wn xn
Denition Let w = (w1 , . . . , wn ) ∈ Rn be a weight vector and x ∈ Rn an input vector. A perceptron represents a function P : Rn → {0, 1} with n 1 if w x = X w x > 0 i i P(x) = i=1 0 else
Perceptron
I I I I
Directed two-layer neural network The input variables xi are called features The separating hyperplane hits the origin P Points x above the hyperplane ni=1 wi xi = 0 are classied as positive (P(x) = 1)
The learning rule M+ : set of positive training patterns M− : set of negative training patterns 20 , p. 164 : PerceptronLearning(M+ , M− )
w = arbitrary vector of real numbers Repeat
x ∈ M+ w x ≤ 0 Then w = w + x For all x ∈ M− If w x > 0 Then w = w − x Until alle x ∈ M+ ∪ M− are correctly classied For all If
20 M. Minsky und S. Papert.
Perceptrons.
MIT Press, Cambridge, MA, 1969.
PerceptronLearning(M+ , M− ) ...
x ∈ M+ If w x ≤ 0 Then w = w + x For all x ∈ M− If w x > 0 Then w = w − x
For all
...
(w + x) · x = w x + x2
⇒
w x at some point becomes positive!
(w − x) · x = w x − x2
⇒
w x at some point becomes negative!
21 Caution! This is not a proof of convergence!
21
Example
I I I
M+ = {(0, 1.8), (2, 0.6)}, M− = {(−1.2, 1.4), (0.4, −1)} Initial weight vector: w = (1, 1) Straight line w x = x1 + x2 = 0
(−1.2, 1.4) falsely classed, because (−1.2, 1.4) · Thus w = (1, 1) − (−1.2, 1.4) = (2.2, −0.4)
1 1
= 0.2 > 0.
Convergence
Theorem M− be linearly separable by a hyperplane w x = 0. Then PerceptronLearning converges for every initialization of the vector w . The perceptron P with the weight vector so calculated divides the classes M+ and M− ,
Let classes
M+
and
that is:
P(x) = 1
⇔
x ∈ M+
P(x) = 0
⇔
x ∈ M− .
and
Linear separability
I I I I
Perceptron cannot divide arbitrary linearly separable classes In R2 straight line goes through the origin In Rn hyperplane goes through the origin P because ni=1 wi xi = 0.
Trick xn := 1 Weight wn =: −θ acts as a threshold n X i=1
I
I
wi xi =
n−1 X i=1
wi xi − θ > 0
(bias unit),
⇔
n−1 X
because
wi xi > θ
i=1
Add a component with constant value 1 to every training vector! The weight wn (threshold θ) will be learned too!
A perceptron Pθ : Rn−1 → {0, 1} n−1 X 1 if wi xi > θ Pθ (x1 , . . . , xn−1 ) = i=1 0 else
(33)
with an arbitrary threshold can be simulated by a perceptron P : Rn → {0, 1} with the threshold 0.
Theorem A function
f : Rn → {0, 1}
can by represented by a perceptron
if and only if the two sets of positive and negative input vectors are linearly separable.
Compare Equation separability.
Proof:
??
with the denition of linear
Example
5
5
5
5
5
5
5
4
4
4
4
4
4
4
3
3
3
3
3
3
3
2
2
2
2
2
2
2
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Positive training examples (M+ )
I I
0
1
2
3
4
5
0
1
2
3
4
5
0 0
1
2
3
4
5
Negative training examples (M− )
0
1
2
3
4
5
Test pattern
Training data learned in 4 iterations over all patterns Evaluating the generalization capability: noisy patterns with a variable amount of inverted (consecutive) bits.
Relative correctness of the perceptron dependending on the amount of inverted bits in the training data. Correctness 1.0 0.8 0.6 0.4 0.2 Bitflips 0
5
10
15
20
25
Optimization and outlook I I
I
Slow convergence Acceleration by normalization of weight change vectors: w = w ± x/|x|. Better initialization of the vector w. X X w0 = x − x, x∈M+
I
I
x∈M−
(see Aufgabe ??) Multilayer networks, e.g. backpropagation are more powerful (Sec. ) Backpropagation can divide classes which are not linearly separable.
The Nearest Neighbour Method
I I I I I
Rote learning with generalization Physician memorizes cases 1. at the diagnosis: remembering similar cases from the past. 2. make the same or a similar diagnosis. Diculties: I Physician must have a good sense for similarity. I
Is the found case similar enough?
What does similarity mean? The smaller their distance in the feature space, the more two examples are similar.
x2
− − − − −
− − − −
+
−
− + + + + + +
+
+
+
−
−
x1 The point marked black is classied negative.
Distance
Distance d(x, y) between two points: metric given by the euclidean norm: v u n uX d(x, y) = |x − y| = t (xi − yi )2 . i=1
weighted features:
v u n uX wi (xi − yi )2 . dw (x, y) = |x − y| = t i=1
Algorithm
NearestNeighbour(M+ , M− , s)
t = argminx∈M+ ∪M− {d(s, x)} If t ∈ M+Then Return(+) Else Return()
Separating classes (virtually)
−
−
−
+ −
+
+
− −
+
+
Voronoi diagramm (left) and the dividing line (right).
I I
Nearest neighbour method is more powerful than the perceptron Arbitrarily complex dividing lines (hyperplanes) are possible −
x2
− − −
− −
+ + + + + + + + + + + +
−
+
−
−
− − − − − − − −
+
−
x1
The point marked black is classied wrongly. I I I
How to deal with outliers? Wrong adaptions to random errors (noise) Overtting.
Example I We apply
NearestNeighbour
I Hamming distance as metric
at Beispiel ??. 22
22 The hamming distance between two bit-vectors is the amount of dierent bits of the vectors.
Correctness 1.0 0.8 0.6 0.4 0.2 Bitflips 0
5
10
15
20
25
Correctness of the nearest neighbour method (grey: perceptron)
k-nearest neighbour method Majority decision regarding the k nearest neighbours:
The algorithm k-NearestNeighbour. k-NearestNeighbour(M+ , M− , s)
V = {k
nearest neighbours
M+ ∪ M− }
If |M+ ∩ V | > |M− ∩ V | Then ElseIf |M+ ∩ V | < |M− ∩ V | Then Else
Return(+) Return() Return(Random(+, ))
More than two classes
I
I
I
Nearest-Neighbour-Klassikation für mehr als zwei Klassen: Klasse des nächsten Nachbarn k-Nearest-Neighbour-Methode: Klasse mit Mehrheit unter den k nächsten Nachbarn. k-Nearest-Neighbour-Methode nur bei wenigen Klassen, bis etwa 10 Klassen
Example 1
Difference of motor voltages v
Agent
Difference of motor voltages v
Autonomous robot with simple sensors learns to avoid light.
0.5 0 −0.5 Training data Classification
−1 0
I Sensor signals
sl
0.5 1 1.5 Ratio of sensor inputs x
from left and
sr
2
1 0.5 0 −0.5 Training data Approximated function
−1 0
0.5 1 1.5 Ratio of sensor inputs x
2
from right sensor,
x = sr /sl I Based on
x,
the dierence
v = Ur − Ul
of the voltages at the
two motors has to be determined.
I The agent has to nd a mapping the correct value
v = f (x).
f,
that for any
x
calculates
Approximation methods
I I I I
Polynomial interpolation, spline interpolation, least squares Problematic in higher dimensions Diculty: in AI, we need model-free approximations Neural networks, support vector machines, Gaussian processes, ..., nearest neighbor
k-nn for approximation problems
1. Determining the set V = {x1 , x2 , . . . xk } of the k nearest neighbours of x 2. Average function value k 1X ˆ f (xi ) f (x) = k i=1
The larger k becomes, the smoother the function fˆ is.
(34)
The distance is relevant
At big k : I I I
Many neighbours with big distance Few neighbours with small distance Calculation of fˆ is dominated by far away neighbours
Solution:
in k-NearestNeighbour votes are weighted
wi =
1 , 1 + αd(x, xi )2
(35)
Approximation: Equation
??
is replaced by
fˆ(x) =
I
I
Pk
i=1 wi f (xi ) . P k w i i=1
The inuence of points approaches zero with increasing distance. All training data can be used for classication/approximation.
k-nn with/without distance weighting k -nearest neighbour 10 000
10 000
10 000
8000
8000
8000
6000
6000
6000
4000
4000
4000
20
22
24
26
28
20
22
24
26
28
20
k=1 k=2 Nearest neighbour with distance weighting 12 000
12 000
10 000
10 000
10 000
8000
8000
8000
6000
6000
6000
4000
4000
4000
22
24
α = 20
26
20
22
α=4
24
24
2
k=6
12 000
20
22
26
20
22
α=1
24
Time complexity
I I I I I I I
Learning by simple memorizing Therefore, very fast learning Classication/approximation of a vector x is very expensive Finding the k nearest neighbours with n training data: Θ(n) Classication/approximation: Θ(k) Total computing time: Θ(n + k) Nearest neighbour methods = lazy learning
Example: avalanche forecast Task: avalanche hazard level to be determined from the amount of fresh snow. lazy learning 5
avalanche hazard level
4 eager learning 3 2 1 20
50
100
3 day new snow sum [cm]
200
Nearest Neighbour methods: elds of application
Nearest neighbour methods are well suited,
for problems which need a good local approximation, but which do not place a high requirement on the speed of the system. Nearest neighbour methods are not suited,
when a description of the knowledge extracted from the data is required, which must be understandable by humans (data mining).
Case-based reasoning (CBR)
I
I
Extension of nn methods to symbolic problem descriptions and their solution. [Ric03] Fields of application: I
Diagnosis of technical problems
I
Telephone hotlines, second level support
CBR example Feature
Query
Case from case base
Defective part: Bicycle model: Year: Power source: Bulb condition: Light cable condition:
Rear light Marin Pine Mountain 1993 Battery ok ?
Front light VSF T400 2001 Dynamo ok ok
Diagnosis: Repair:
? ?
Solution
Front electrical contact missing Establish front electrical contact
Transformation: rear light is mapped to front light
CBR-Schema
query x
solution for x
transformation
reverse transformation
case y
solution for y
CBR: problems
Can the developer predict and map all possible special cases and problem variants? Similarity A suitable similarity metric for symbolic features. Transformation How to nd the transformation mapping and its inverse? Modeling
Solutions: I Interesting alternative to CBR: bayesian networks. I Symbolic problem representation can often be mapped on numeric features. I Then other methods are applicable.
Decision tree learning
Snow_Dist
(1,2,3,4,5,6,7,8,9,10,11)
> 100
100 yes, no
Description Should I drive to the nearest ski resort with enough snow? Is there sunshine today? Distance to the nearest ski resort with good snow conditions (over/under 100 km) Is it the weekend today?
Variablesfor the skiing classication problem.
Training data
Day 1 2 3 4 5 6 7 8 9 10 11
Snow_Dist
Weekend
Sun
Skiing
≤ 100 ≤ 100 ≤ 100 ≤ 100 > 100 > 100 > 100 > 100 > 100 > 100 > 100
yes yes yes no yes yes yes yes no no no
yes yes no yes yes yes yes no yes yes no
yes yes yes yes yes yes no no no no no
The rows 6 and 7 are contradicting! Thus the tree from Abbildung ?? is optimal!
How does the tree evolve from data?
Optimal algorithm:
create all trees, test them on the data and nally choose the tree with the minimal error. Disadvantage:
Thus:
inacceptably large computing time!
heuristic algorithm with greedy strategy
Entropy as a metric for information content I
I
The attribute with the highest information gain shall be choosen. Training data set S = (yes,yes,yes,yes,yes,yes,no,no,no,no,no) with the estimated probabilities
p1 = p(yes) = 6/11 and p2 = p(no) = 5/11. I
Probability distribution
p = (p1 , p2 ) = (6/11, 5/11), I
in general
p = (p1 , . . . , pn ) I
with
n X i=1
pi = 1.
Two extreme cases Sure event:
p = (1, 0, 0, . . . , 0).
(36)
1 1 1 p = ( , ,..., ) n n n
(37)
Uniform distribution:
Unvertainty is maximal!
Claude Shannon:
how many bits would be needed minimally to encode such an event? 1. 1. case (Equation 2. 2. case (Equation
??):
zero bits ??): log2 n bits
General case: p = (p1 , . . . , pn ) = ( I I
n X i=1
I
(38)
log2 mi bits ar required for the i th case. Expected value H for the number of bits: H=
I
1 1 1 , ,..., ) m1 m2 mn
pi log2 mi =
n X i=1
pi log2 1/pi = −
n X
pi log2 pi .
i=1
The more bits we need to encode an event, the higher the uncertainty about the outcome. Thus: entropy H as a metric for the uncertainty of a probability distribution:
H(p) = H(p1 , . . . , pn ) := −
n X i=1
pi log2 pi .
Problem: 0 log2 0 undened! Denition 0 log2 0 := 0 (see Aufgabe ??). Thus: H(1, 0, . . . , 0) = 0
and
1 1 H( , . . . , ) = log2 n n n unique global maximum of the entropy!
Special case: two classes
H(p) = H(p1 , p2 ) = H(p1 , 1−p1 ) = −(p1 log2 p1 +(1−p1 ) log2 (1−p1 )) 1 0.9 0.8
H(p,1−p)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
1
The entropy function for the case of two classes.
Data set D with probability distribution p:
H(D) = H(p). Information content I (D) from the data set D is the opposite of uncertainty, thus I (D) := 1 − H(D). (39)
The information gain
If we now apply the entropy formula to the example, the result is
H(6/11, 5/11) = 0.994 Information gain G(D,A) =
n X |Di | i=1
|D|
I (Di ) − I (D)
With Equation G(D,A) =
??
we obtain
n X |Di | i=1
= 1−
|D|
I (Di ) − I (D) =
n X |Di | i=1
= H(D) −
|D|
i=1
|D|
(1 − H(Di )) − (1 − H(D)
H(Di ) − 1 + H(D)
n X |Di | i=1
n X |Di |
|D|
H(Di )
4 7 G(D,Snow_Dist) = H(D) − H(D≤100 ) + H(D>100 ) 11 11 4 7 ·0+ · 0.863 = 0.445 = 0.994 − 11 11
Analogously we obtain G(D,Weekend) = 0.150 and
G(D,Sun) = 0.049
thus: Snow_Dist becomes the root node [6,5]
[6,5]
[6,5]
Snow_Dist
Weekend
Sun
100
[4,0] Gain:
[2,5] 0.445
yes
no
yes
no
[5,2]
[1,3]
[5,3]
[1,2]
0.150
Vergleich
0.049
Day 1 2 3 4 5 6 7 8 9 10 11
Snow_Dist
Weekend
Sun
Skiing
≤ 100 ≤ 100 ≤ 100 ≤ 100 > 100 > 100 > 100 > 100 > 100 > 100 > 100
yes yes yes no yes yes yes yes no no no
yes yes no yes yes yes yes no yes yes no
yes yes yes yes yes yes no no no no no
Data set for the skiing classication problem. For D≤100 , the classication is clearly Thus the tree terminates here.
yes.
In the other branch D>100 there is no clear result. Thus G(D>100 ,Weekend) = 0.292 G(D>100 ,Sun) = 0.170 Thus the node contains the attribute Weekend. For Weekend = no, the tree terminates with the decision For Weekend = yes, Sun results in a gain of 0.171. The construction of the tree terminates because no further attributes are available.
no.
Snow_Dist
(1,2,3,4,5,6,7,8,9,10,11)
> 100
100, >100, >100,
yes, yes, yes, no, yes, yes, yes, yes, no, no, no,
yes, yes, no, yes, yes, yes, yes, no, yes, yes, no,
yes yes yes yes yes yes no no no no no
Information about attributes and classes in le ski.names:
|Classes: | no,yes. | |Attributes | Snow_Dist: Weekend: Sun:
no: do not ski, yes: go skiing
100. no,yes. no,yes.
unixprompt> c4.5 -f ski -m 1
C4.5 [release 8] decision tree generator Wed Aug 23 ---------------------------------------Options: File stem Sensible test requires 2 branches with >=1 cases Read 11 cases (3 attributes) from ski.data Decision Tree: Snow_Dist = Snow_Dist = | Weekend | Weekend | | Sun | | Sun
100: = no: no (3.0) = yes: = no: no (1.0) = yes: yes (3.0/1.0)
Simplified Decision Tree: Snow_Dist = 100: no (7.0/3.4) Evaluation on training data (11 items): Before Pruning ---------------Size Errors 7
1( 9.1%)
After Pruning --------------------------Size Errors Estimate 3
2(18.2%)
(41.7%)
c4.5 -f app -u -m 100 C4.5 [release 8] decision tree generator ----------------------------------------
Wed Aug 23 13:13:15 2006
Read 9764 cases (15 attributes) from app.data Decision Tree: Leukocytes 381 : 1 (135.9/54.2) | | Temp_rectal 11030 : 1 (5767.0/964.1) Leukocytes 381 : 1 (135.9/58.7) | | Temp_rectal 8600 : 1 (984.7/322.6) | | Leukocytes 378 : 1 (176.0/64.3) | | | Temp_rectal 10^(-3) CRold = CR; PR = A’ * CR; CR = B * PR; colsums = sum(CR); for j = 1:m CR(:,j) = CR(:,j)/colsums(j); endfor CR = CR + eta * eye(m); delta = norm(CR - CRold); endwhile endfunction
The link-analysis algorithm, result Product representativeness:
Interaction matrix A:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
p1 1 1 1 1 1 0 0 0 0 1
p2 0 1 0 0 1 0 1 1 1 1
p3 1 1 1 1 1 0 1 1 0 0
p4 0 0 1 1 0 0 0 0 0 1
p5 0 0 0 0 0 1 0 0 1 0
p6 0 1 0 1 1 0 0 0 1 0
PR T
=
1.73 1.66 1.76 1.75 1.66 0.48 0.59 0.59 0.49 1.71
0.59 1.67 0.56 0.58 1.67 0.56 1.72 1.72 1.70 1.66
1.88 1.82 1.86 1.84 1.82 0.60 1.84 1.84 0.64 0.78
0.33 0.28 1.39 1.37 0.28 0.22 0.25 0.25 0.21 1.37
0.04 0.09 0.03 0.06 0.09 1.33 0.08 0.08 1.27 0.07
0.37 1.43 0.37 1.43 1.43 0.38 0.37 0.37 1.46 0.39
Comparison: user-based vs. link-analysis
Interaction matrix A:
WC ·A =
4.60 4.38 4.37 4.35 4.38 1.45 3.50 3.50 1.63 3.67
3.40 4.50 2.67 2.65 4.50 2.05 4.50 4.50 3.33 3.33
Product representativeness:
5.35 5.38 4.71 4.60 5.38 1.85 5.00 5.00 2.29 3.67
2.10 1.62 2.52 2.35 1.62 0.74 1.38 1.38 0.48 2.17
0.49 0.68 0.28 0.33 0.68 1.72 0.84 0.84 1.72 0.61
2.40 3.12 1.85 2.40 3.12 1.27 2.38 2.38 2.15 1.83
1.73 0.59 1.88 1.66 1.67 1.82 1.76 0.56 1.86 1.75 0.58 1.84 T 1.66 1.67 1.82 PR = 0.48 0.56 0.60 0.59 1.72 1.84 0.59 1.72 1.84 0.49 1.70 0.64 1.71 1.66 0.78
0.33 0.28 1.39 1.37 0.28 0.22 0.25 0.25 0.21 1.37
0.04 0.09 0.03 0.06 0.09 1.33 0.08 0.08 1.27 0.07
0.37 1.43 0.37 1.43 1.43 0.38 0.37 0.37 1.46 0.39
Content-based ltering
I I I I
Products are recommended based on their quality. Product description via a set of features x1 , . . . xk . Goal: nd a mapping from product features to a score. Sort products wrt. their score
Example: the system yakadoo
Example: the system yakadoo
...
more questions
Example: the system yakadoo
Example: the system yakadoo
I I I
I
Expert system for customer service Replaces the salesperson Goal: Recommend a product that optimally matches the customer's requirements Solution 1: Nearest neighbour method
Nearest neighbour ltering
I I
Target feature vector t = (t1 , . . . , tk ) from customer interview Find most similar product in database containing items31 x1 , . . . , xn recommended product = argmin d(t, xi ) i=1,...,n
I
Euclidean distance
d(t, x) =
k X i=1
31 More precise:
x1 , . . . , xn
(ti − xi )2 .
are the feature vectors of the items.
Weighted nearest neighbour ltering
I
Weighted Euclidean distance
d(t, x) =
k X i=1
I
wi (ti − xi )2 .
How to nd weight vector (w1 , . . . , wk )? I
Manually
I
Minimize number of misclassications wrt. weight vector!
Teil IX
Neural Networks From Biologie to Simulations Hopeld networks Neural associative memory Linear networks with minimum error The backpropagation algorithm Support-Vector-Machines
Neural Networks
I I I
I I I I I
The human brain has between 10 and 100 billion nerve cells. Around the year 1900 biologists started understanding neurons 1943 McCulloch and Pitts: A logical calculus of the ideas immanent in nervous activity (in [AR88]). Bionics-fork within AI Every neuron is connected to 1000-10.000 other neurons. Approximately 1014 connections Time of an impulse of an neuron is appr. one millisecond Thus, the frequency of an neuron is below one kilohertz
Biologie
cell body axon
dendrite synapse
Simulation
The mathematical model Charging of the activation potential by summation of weighted output values of x1 , . . . , xn input connections n X
wij xj .
j=1
Neuron i
x1 . . .
xj
. . .
xn
xi = f (
n X j=1
wij xj )
xi
Activation functionf
xi = f
n X
! wij xj
j=1
Threshold function (Heaviside step function) 0 if x < Θ f (x) = HΘ (x) = . 1 else The neuron computes its output by Pn 0 if j=1 wij xj < Θ xi = . 1 else The equation is identical with a Perzeptron with threshold Θ.
xi
w1 x1
w2 x2
w3
wn
x3
xn
Step function is discontinuous. Alternative:
Sigmoid function:
1 0.8 0.6 f(x) 0.4
T = 0.01 T = 0.4 T=1 T=2 T=5 T = 100
0.2 0
Θ
f (x) =
1 1 + e−
x−Θ T
Modeling learning
D. Hebb, 1949 (Hebb-rule): If there is an connection wij from neuron j and i where repeatedly signals from neuron j to neuron i are sentwhich leads to a parallel activation of both neuronsthen weight wij is amplied. A possible formula for the weight change ∆wij is
∆wij = ηxi xj with a constant η (learning rate) that denotes the size of learning step.
Hopeld networks
I I I
With the Hebb-rule, the weights can only grow. Weakening is not possible. Weakening can be modeled with a constant decay parameter that decays unused weights at each time step (e.g. by 0.99).
Another Solution (Hopeld 1982) I I I
binary neuron with value ±1 ∆wij = ηxi xj When does ∆wij becomes positive / negative ?
Autoassociative Memory
I I I
In an autoassociative memory, patterns can be memorized. Recall: input of similar patterns. Classical Application: Recognition of characters.
Learning phase:
N binary coded patterns q1 , . . . , qN shall be learned. qij ∈ {−1, 1}: one pixel per pattern. I I I I
n pixel: neural network consisting of n neurons Neurons are completely connected. Weight-matrix symmetrical Diagonal elements wii = 0
Recurrent connection of two neurons
wji xi
xj wij
Learning
N
wij =
1X k k q q . n k=1 i j
(43)
Relation to the Hebb-rule!
Pattern recognition
Input of a new pattern x and update of all neurons by n X −1 if wij xj < 0 xi = j=1 j6=i 1 else until network gets stable.
(44)
Programming scheme
HopeldAssociator(q) Initialize all neurons: x = q Repeat
i = Random(1,n) Update neuron i according to Equation Until x converges Return(x)
??
Application to a pattern-recognition example
I I I
Digits in a 10 × 10 pixel eld Hopeld network consists of 100 neurons totally 100·99 = 4950 weights 2
The four patterns learned by the network.
10% Noise
62 Steps
144 Steps
232 Steps
379 Steps
10% Noise
65 Steps
147 Steps
202 Steps
301 Steps
10% Noise
66 Steps
112 Steps
150 Steps
359 Steps
A few stable states of the network that were not learned.
The ten learned patterns.
↓
Six noisy patterns with 10% noise.
↓
↓
↓
↓
The stable states from noisy patterns.
↓
Analysis
neurons
elementary magnets
wij xi 1
xj wji
-1
− 21
1 2
wij
Neural and physical interpretation of the Hopeld model.
Total energy of the system:
E =−
1X wij xi xj . 2 i,j
Applies also in physics: wii = 0 and wij = wji .
Hopeld dynamics I I I
Physical system in equilibrium minimizes E (x, y). The system moves to a state of minimum energy. In every step, the Hopeld dynamic updates the state of a neuron towards minimum total energy.
Contribution of neuron i to the total energy: n
1 X − xi wij xj . 2 j6=i If
n X
wij xj < 0
j6=i
then xi = −1 gives a negative contribution to the total energy; on the other hand, xi = 1 gives a positive contribution.
Analogously:
n X j6=i
wij xj ≥ 0
requires xi = 1 I I I
The total energy of a system reduces monotonically over time. Network moves to a state with minimum energy. What signicance has the minimum of the energy function?
Analogously: achieving
n X j6=i
wij xj ≥ 0
requires xi = 1 I I I I I
I I
total energy of the system reduces monotonically over time. Network moves to a state with minimum energy. What signicance has the minimum of the energy function? The learned patterns are minima of the energy function. If too much patterns were learned, the system converges to minima that are not the learned patterns. Change from order to chaos. phase change.
Physics I I I
Melting of a ice crystal. Inside the crystal is a state of high order. In uid water, the structure of the molecule is debrated.
Neural network I I I I
phase change ordered learning and recognition of patterns chaotic learning with too many patterns eects that we observe by ourselves
Phase change in Hopeld networks are neurons are in pattern state q1 n X In wij qj all learned weights from Equation
??
are used:
j=1
j6=i n X j=1
j6=i
wij qj1
N n n N X 1 XX k k 1 1 X 1 1 2 = q q q = qi (qj ) + qik qjk qj1 n j=1 k=1 i j j n j=1 k=2 j6=i
j6=i n
= qi1 +
!
N
1 XX k k 1 q q q n j=1 k=2 i j j j6=i
i 'th component of the input pattern plus a sum of (n − 1)(N − 1) terms. If these terms are statistically independent,
the sum can be described by a normally distributed random variable with zero mean and standard deviation r 1p N −1 (n − 1)(N − 1) ≈ . n n I I I
I I
I
Noise is not disturbing, as long as N n! detailed computation says that critical point is N = 0.146 n. Example: When using 100 Neurons, up to 14 uncorrelated patterns can be memorized. Memory capacity is below that of a list memory! Hopeld networks word only good, if patterns have ca. 50% 1-bits Workaround: Neurons with threshold
Summary and outlook
I
I I
I
I I
The Hopeld model caused in the 80's a wave of enthusiasm for neural networks Hopeld networks are recurrent Networks without recurrent connections are easier to understand Problem: local minima ⇒ Boltzmann-Machine, simulated annealing Hopeld dynamic is also applicable to other energy functions e.g. Traveling salesman problem
Neural associative memory
I I I
I I
I I I
Phone book: Mapping from names to phone numbers realized in tables within a database Entrance control to a building by using a photography of the person Problems, when only photographs are memorized in a database Solution: associative memory: can relate similar photographs to the right name. typical task for machine learning methods. Straightforward approach: Nearest neighbour method. For entrance control not applicable. Why?
Correlation matrix memory given is:
Training data with N request-response pairs:
T = {(q1 , t1 ), . . . , (qN , tN )} wanted:
Matrix W, that maps correctly all requests vectors to
responses for p = 1, . . . , N p
or
p
t =W·q ,
tip
wij qjp respectively
j=1
t1
tj
w11
q1
=
n X
tm
wmn
qi
qn
(45)
Computation of the matrix elements wij (Hebb-rule):
wij =
N X
qjp tip
p=1
Denition Two vectors x and y are called orthonormal, if 1 if x = y x·y = 0 else
(46)
Theorem p request vectors q from training data are orthonormal, p any vector q multiplied with the matrix W is mapped to the p response vector t . If all
Proof:
N
We insert Equation
p
(W · q )i = =
n X
wij qjp
=
in Equation
n X N X
qjr tir qjp
n X
qjp qjp
j=1
+
N X r =1
| {z } =1
r 6=p
tir
??
=
j=1 r =1
j=1
tip
??
and derive
n X
(
j=1 n X
qjp qjp tip
+
N X r =1
r 6=p
qjr qjp = tip
j=1
| {z } =0
suppose, the name Hans is the correct output for a certain face. Possible output when inputting similar faces: z.B. Gans oder Hbns
Problem:
qjr qjp tir )
2
The pseudo inverse Request vector as columns of a n × N matrix: Q = (q1 , . . . , qN ) Response vector as columns of a m × N matrix T = (t1 , . . . , tN ):
T=W·Q
(47)
W = T · Q−1 .
(48)
Solving the equation to W:
How to invert a non-invertible matrix? A matrix Q is invertible if there is a matrix Q−1 with the property
Q · Q−1 = I, Wanted:
A matrix, as close as possible to this property
(49)
Denition Q is a real n × m matrix. A m × n matrix Q+ is denoted as pseudo inverse to Q, if it minimizes ||Q · Q+ − I|| Herein ||M|| is the quadratic norm, which is the sum over squares of all matrix elements of M .
So:
W = T · Q+
(50)
Weight matrix W minimizes the associative error (engl. crosstalk) T−W·Q . I I
The search for the pseudoinverse is not easy. Backpropagation algorithm
The binary Hebb rule (Palm-model) Patterns are binary encoded: qp ∈ {0, 1}n and tp ∈ {0, 1}m . n = 10, m = 6:
q3 q2 q1
11 1 11
11
1 1111 11
1 1 1 1 1
1 1 1 1 1
1 1 111 1 t1 t2 t3
Recall of the memorized patterns
q3 q2 q1
11 1 11
11
1 1111 11
1 1 1 1 1
1 1 1 1 1
2 0 3 3 1 0
3 0 2 3 0 0
0 0 1 3 3 0
Berechnung der Produkte Wq1 , Wq2 , Wq3 . N _ binary Hebb rule: wij = qjp tip . p=1
(51)
Memory capacity I
Weight matrix must be sparse
Matrix has m n elements. A pair to be memorized has m + n bits. Count of the memorizable patterns Nmax :32 Count of the memorizable bits (m + n)Nmax = ≤ ln2 ≈ 0.69. Count of the memory cells mn (52) Memory model α List memory 1 Associative Memory with binary Hebb rule 0.69 Kohonen associative memory 0.72 Hopeld networks 0.292
α=
32 G. Palm. On Associative Memory. In: S. 1931.
Biological Cybernetics
36 (1980),
An error-correction program Coding of pairs for request vectors q Request vector consists of 676 Bits for every pair aa, ab, . . . , az, ba, . . . , bz, . . . , za, . . . , zz. Example: For hans, the combinations ha, an and ns are set to 1. 26 · 26 = 676 ordered pairs of letters. In the response vector t, all positions in the word are reserved with 26 bit (word length is max. 10). For hans, the bits 8, 27, 66 and 97 are set. The weight matrix W has size 676 · 260 bit = 199420 bit. Memory capacity
Nmax ≤ 0.69
mn 676 · 260 = 0.69 ≈ 130 words m+n 676 + 260
Memorized names: agathe, agnes, alexander, andreas, andree, anna, annemarie, astrid, august, bernhard, bjorn, cathrin, christian, christoph, corinna, corrado, dieter, elisabeth, elvira, erdmut, ernst, evelyn, fabrizio, frank, franz, georey, georg, gerhard, hannelore, harry, herbert, ingilt, irmgard, jan, johannes, johnny, juergen, karin, klaus, ludwig, luise, manfred, maria, mark, markus, marleen, martin, matthias, norbert, otto, patricia, peter, phillip, quit, reinhold, renate, robert, robin, sabine, sebastian, stefan, stephan, sylvie, ulrich, ulrike, ute, uwe, werner, wolfgang, xavier
Associations of the program: Insert pattern: harry
Insert pattern: andrees
Threshold: 4, Response: harry Threshold: 6, Response: a Threshold: 3, Response: harry Threshold: 5, Response: andree Threshold: 4, Response: andrees Threshold: 2, Response: horryrrde Threshold: 3, Response: mnnrens ------------------------------- Threshold: 2, Response: morxsnssr Insert pattern: ute -------------------------------
Linear networks with minimum error Learning from errors Given: Training vectors
Idea:
T = {(q1 , t1 ), . . . , (qN , tN )} mit qp ∈ [0, 1]n tp ∈ [0, 1]m . a function f : [0, 1]n → [0, 1]m that minimizes the quadratic error N X (f (qp ) − tp )2
Wanted:
p=1
on the data.
Possible solution:
f (q) = 0, und
falls
q∈ / f (q) = 0,
f (qp ) = tp
if
q∈ / {q1 , . . . , qN }
∀ p ∈ {1, . . . , N}.
Why don't we become happy with this function?
Possible solution:
f (q) = 0, and
f (qp ) = tp
if
q∈ / {q1 , . . . , qN } ∀ p ∈ {1, . . . , N}.
Why don't we become happy with this function? I I I I I
Because we want to build an intelligent system! Generalizing from training data to new unknown data We do not want overtting We want a plain function that balances between the points But we have to further conne the function class.
The method of least squares Linear two-layered network y
w1 x1
w2 x2
y =f
w3
wn
x3 n X
xn
! wi xi
i=1
computed by f (x) = x . Sigmoid function is irrelevant, because it is strictly monotonically increasing!
Wanted: Vector w that minimizes
E (w) =
N X p=1
(wqp − t p )2 =
n N X X p=1
i=1
!2 wi qip − t p
Necessary condition for a minimum of the error function: For j = 1, . . . , n we require ! N n X X ∂E =2 wi qip − t p qjp = 0 . ∂wj p=1 i=1 Expanding gives N n X X p=1
i=1
! wi qip qjp − t p qjp
=0
.
Exchange of the sums: n X
wi
i=1
N X
qip qjp
=
p=1
N X
t p qjp ,
p=1
With
Aij =
N X p=1
qip qjp
and
bj =
N X
t p qjp
we get the matrix equation (normal equation):
Aw = b I
I
for i, j = 1, . . . , n
p=1
(53) (54)
Normal equations have at minimum one solution; and only one solution if A is invertible. A is positive denite, and thus the solution w is a global minimum.
Computing time: I I
Setting up of the matrix: Θ(N · n2 ) Solving the system of equations: O(n3 ).
This method can easily be extended to multiple output neurons
Application to appendicitis
Linear mapping of the symptoms to the continuous class variable AppScore: AppScore
=
0.00085
Alter
−0.025
S4Q
+0.081
I
+ 0.12
Ersch
+0.000021
− 0.125
Geschl
AbwLok
+ 0.0034
Leuko
+ 0.031
RektS
− 0.11
continuous values for
+ 0.025
Diab
+ 0.035
AbwGlo
+ 0.0027
TAxi
S2Q
+ 0.13
− 0.021
Losl
+ 0.0031
TRek
− 1.83.
AppScore,
I Threshold decision!
S1Q
although
App
is binary!
S3Q
Error rate
0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2
Least squares on 6640 training data Least squares on 3031 test data RProp on 3031 test data
0
0.2
0.4 0.6 Treshold Θ
0.8
Error from least squares on training- and testdata.
1
The delta rule I I
So far: batch learning Now: incremental learning. In every step, weights are adjusted according to the new training sample wj = wj + ∆wj
I Error minimization,
thus again, partial derivative ! n N X X ∂E wi qip − t p qjp . =2 ∂wj p=1 i=1
I
Gradient
∇E =
I
∂E ∂E ,..., ∂w1 ∂wn
Gradient descent: N n X X ∂E ∆wj = −η = −η wi qip − t p ∂wj p=1 i=1
! qjp ,
By replacing of the activation p
y =
n X
wi qip
i=1
of the output neuron with input pattern qp : Delta rule
∆wj = η
N X p=1
(t p − y p )qjp .
DeltaLearning(Training
examples, η )
Initialize all weights wj arbitrarily Repeat
∆w = 0 For all (qp , tp ) ∈ Training examples Compute network output yp = wp qp ∆w = ∆w + η(tp − yp )qp w = w + ∆w Until w converges
Learning rate
η
∆wj = η
N X p=1
(t p − y p )qjp .
The incremental delta rule I
I
After all training examples have been applied, all weights are changed. Alternative: direct change of the weights after each training example DeltaLearningIncremental(Training
examples, η )
Initialize all weights wj arbitrarily Repeat
(qp , tp ) ∈ Training examples Compute network output yp = wp qp w = w + η(tp − yp )qp w converges
For all
Until
Comparison to the perceptron
I
I
I
With the perceptron, a classier for linearly separable classes is learned by threshold decision. Method of least squares and Delta rule generate a linear approximation on the data . From the learned linear approximation, a classier can be generated by applying a threshold function.
If incremental learning is not required, the method of least squares should be favored (among others because of computing times)
The backpropagation algorithm
I I I I
Extension of the incremental delta rule with sigmoid function more than two layers of neurons Known from the article33 of the legendary PDP-Collection34 .
33 D.E. Rumelhart, G.E. Hinton und Williams R.J.
Representations by Error Propagation. 34 D. Rumelhart und J. McClelland.
MIT Press, 1986.
Learning Internal
in [RM86]. 1986.
Parallel Distributed Processing.
Bd. 1.
2 w11
tjp
target output
xjp
output layer
wn23n2
hidden layer 1 w11
wn12n1 input layer
Neuron model:
n X
xj = f
! wji xi
(55)
i=1
with the sigmoid function
f (x) =
1 . 1 + e −x
Similar to the incremental delta rule, according to the negative gradient of the summed quadratic error function over all output neurons 1 X Ep (w) = (tkp − xkp )2 2 k∈Ausgabe
are changed for the training pattern p :
∆p wji = −η
∂Ep . ∂wji
Repeated application of the chain rule (see35 or36 ) provides the backpropagation learning rule (generalized delta rule)
∆p wji = ηδjp xip , with
p p p p xj (1 − xj )(tj − xj ) if j is output neuron X p δjp = p p x (1 − x ) δk wkj if j is hidden neuron, j j k
j
Naming of neurons and weights for the application of the backpropagation learning rule.
wji i
35 D.E. Rumelhart, G.E. Hinton und Williams R.J.
Learning Internal Representations by Error Propagation. in [RM86]. 1986. 36 A. Zell. Simulation Neuronaler Netze. Im Buch beschriebener Simulator
BackPropagation(Training
examples, η )
Initialize all weights wj arbitrarily Repeat For all
(qp , tp ) ∈ Training
examples
1. Feeding of the request vector
qp to the input layer
2. forward propagation:
3.
For all layers, upwards, starting from the rst hidden For all neurons of the layer P Compute activation xj = f ( ni=1 wji xi ) Compute the squared error Ep (w)
4. backward propagation:
Until
For all layers, downwards, starting from the last For all weights wji wji = wji + ηδjp xip w converges or time limit reached
Remarks I I I
I
Non-linear mappings can be learned. Without a hidden layer, only linear mappings can be learned. Multiple-layered networks with only linear activation functions can just learn linear mappings variable sigmoid function
f (x) =
I
I
1 1 + e −(x−Θ)
.
used with threshold Θ. Same as with the perceptron, to each layer a bias neuron is added that constantly has activation 1, and has connections to all neurons of the next upper layer. The weights of these connections are trained normally, which represent the threshold Θ of the successor neurons.
NETtalk: A network learns to speak
Sejnovski and Rosenberg 198637 A system learning to read aloud and understandable.
37 T.J. Sejnowski und C.R. Rosenberg.
to read aloud.
NETtalk: a parallel network that learns
Techn. Ber. JHU/EECS-86/01. Wiederabdruck in [AR88] S.
661-672. The John Hopkins University Electrical Engineering und Computer Science Technical Report, 1986.
accenteddeep
central
26 neurons
80 neurons
... abc...z_,.
... abc...z_,.
... abc...z_,.
... abc...z_,.
... abc...z_,.
... abc...z_,.
... abc...z_,.
7 x 29 = 203 neurons
the_father_is Demo:Nettalk
(http://www.cnl.salk.edu/ParallelNetsPronounce/index.php)
I I I
I I I
I I
I
The network has been trained with 1000 words For every character, the stress has been specied as output. For converting the stress attributes to real tones, a part of the speech-synthesiser system DECtalk has been used. The network consists of 203 × 80 + 80 × 26 = 18320 weights. On VAX 780, 50 cycles for all words have been trained. With 5 characters per word: 5 · 50 · 1000 = 250 000 iterations for the backpropagation algorithm. 69 hours of computing time Human properties: From the beginning, only simple unclear words. Later: correctness of 95%.
Learning of heuristics for theorem prover
I
I I I
The semantic web requires automatic theorem provers for responding to search requests. For proving, great number of possible inference steps. Combinatorial explosion of the search space Heuristic search: e.g. A? -algorithm
I heuristic evaluation function???
Search tree of a prover
XX XXX X X @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
: positive training data
: negative training data : unused
The Training I I I
I I I
evaluating multiple alternative for the next step select the alternative having the best benchmark Resolution: Evaluation of available clauses based on attributes such as: number of literals, number of positive literals, complexity of a term, etc. Positive and negative training data On this data, a backpropagation network is trained With this learned network, the prover is much faster (see38 ,39 )
38 W. Ertel, J. Schumann und Ch. Suttner. Learning Heuristics for a Theorem Prover using Back Propagation. In:
Articial-Intelligence-Tagung.
5. Österreichische
Hrsg. von J. Retti und K. Leidlmair. Berlin,
Heidelberg: Informatik-Fachberichte 208, Springer-Verlag, 1989, S. 8795.
39 Ch. Suttner und W. Ertel. Automatic Acquisition of Search Guiding
Heuristics. In:
10th Int. Conf. on Automated Deduction. Springer-Verlag,
449, 1990, S. 470484.
LNAI
Problems and Improvements I I I
local minima of the error function convergence of backpropagation often slowly momentum term in learning rule:
∆p wji (t) = ηδjp xip + γ∆p wji (t − 1)
I
I
I
Minimizing of linear error function instead of quadratic (L1 -Norm) quadratic approximation of the error function (Fahlmann): Quickprop Adaptive increment control (Riedmiller): RProp
Support Vector Machines
Advantage of linear neural networks: I fast learning I Convergence guarantee I low danger for overtting Advantages of non-linear networks (e.g. backpropagation): I complex functions can be learned Disadvantage of non-linear networks (e.g. backpropagation): I local minima, convergence problems, overtting
Solution: Support-Vector-Machines (SVM) 1. non-linear transformation of the data such that transformed data is linear separately. This transformation is called a kernel. 2. In the transformed space, the support vectors are determined. −
x2
− − − + + + + + +
− − − −
+
−
+ +
+
−
x1
Linear Separation of the classes:
If data is consistend, it is always possible to generate linear separately data by transforming the vector space of the classes.40 e.g. with the transformation of the data by introducing a new dimension xn+1 : 1 if x ∈ Class 1 xn+1 = 0 if x ∈ Class 0.
40 A data point is inconsistent if it belongs to both classes.
x2 −
−
+ − − − − + + − + − − + + + + + + + +
x1
−
−
−
x3
−
− x2
− − − − + +
+
+ +
+ + +
+ +
+
x1 It's still not that easy. Why?
I
SVMs are also applicable to regression problems
Literature:
I
S. Schölkopf und A. Smola.
Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond.
I
I
MIT Press, 2002 E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004 C. J. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. In: Data Min. Knowl. Discov. 2.2 (1998), S. 121167. doi: http://dx.doi.org/10.1023/A:1009715923555
Applications of neural networks I I I I I
I I I I I I I
In all areas of industry All kinds of pattern recognition Analysis of photographs for recognizing people or faces Recognition of sh swarms on sonar echos Recognition an classication of military air planes on radar scans Recognition of speech and handwritten texts Robot control Heuristic search control in backgammon or chess-computers Application to reinforcement learning (chapter 10) Forecasting of stock prices Evaluation of credit worthiness of bank customers ...
Summary and Outlook I I I
I
Perceptron, delta rule and backpropagation Related to scores, naive bayes and method of least squares Hopeld networks are very demonstrative, but hard to handle in practice In practice important: associative memory models: Kohonen, Palm
Applies for all NN models: I
I
I I
Information is distributively memorized over many weights (holographic). Dying of single neurons has almost no impact to the functioning of the brain. Networks are robust against small disturbances. Recognition of defect patterns.
Disadvantages: I I
I I
It is hard to localize information. In practice, it's impossible to analyze /or to understand the weights trained in the network ( ⇒ decision trees) The Grandmother neuron is hard to nd Combining NNs with symbol-processing systems is problematic.
Not examined: I I I I
self-organizing maps (Kohonen) incremental learning sequential networks for the learning of time-dependent processes Learning without a teacher (see chapter 10)
Literature: I
I
I
A. Zell. Simulation Neuronaler Netze. Im Buch beschriebener Simulator SNNS, bzw. JNNS: www-ra.informatik.uni-tuebingen.de/SNNS. Addison Wesley, 1994, H. Ritter, T. Martinez und K. Schulten. Neuronale Netze. Addison Wesley, 1991, R. Rojas. Theorie der neuronalen Netze. Springer, 1993.
Collection of important original work in I J. Anderson und E. Rosenfeld. Neurocomputing: Foundations of Research. Sammlung von Originalarbeiten. Cambridge, MA: MIT Press, 1988, I J. Anderson, A. Pellionisz und E. Rosenfeld. Neurocomputing (vol. 2): directions for research. Cambridge, MA, USA: MIT Press, 1990
Journals I Neural Networks I Neurocomputing Conference: NIPS (Neural Information Processing Systems)
Teil X
Reinforcement Learning
Introduction
I I I
Robotic Tasks are often complex not programmable
Task I I I I
learning by trial-and-error which actions are In this way, humans learn how to walk reward for forward movements punishment for fallings
good
Negative Reinforcement
Next time swing earlier? Ski slower? Learning by punishment.
Positive Reinforcement
The crawler
gy
gx
−5 −4 −3 −2 −1 0
1
2
3
4
5
x
The crawler
10 000000000000000000000000 111111111111111111111111 10 −5 −4 −3 −2 −1 0
1
2
3
4
5
The crawler
−5 −4 −3 −2 −1 0
1
2
3
4
5
The crawler
−5 −4 −3 −2 −1 0
1
2
3
4
5
The crawler
−5 −4 −3 −2 −1 0
1
2
3
4
5
The crawler
10 000000000000000000000000 111111111111111111111111 10 −5 −4 −3 −2 −1 0
1
2
3
4
5
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The walking robot
−1 0
1
2
3
x
The state space
−1 0
1
2
x
3
−1 0
1
2
x
3
← → ↑↓ ↑↓
← →
−1 0
1
2
3
x
−1 0
1
2
3
x
The state space
li. re. ob. ← → ↑↓ ↑↓ unt. ← →
1 2 3 4 ← ← ← 1 → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 2 ← → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 3 ← → → → ↑↓ ↑↓ ↑↓ ↑↓ ← ← 4 ← → → →
1 2 3 4 1 ↓ ↓ ↓ ↓ 2 → → → ↑ ↓ 3 ← ← ← ↑ ↑ ↑ 4 ←
State space: 2 x 2 (left), 4 x 4 (middle), optimal strategy (right).
The agent
agent
state s action a
environment
The agent
reward r
agent
state s action a
environment
The learning task State
st ∈ S :
at
st −→ st+1 Transition function
δ:
st+1 = δ(st , at )
Immediate reward
rt = r (st , at ) rt > 0: positive reward rt = 0: no reward rt < 0: negative reward Policy
π:S→A
often for a long time rt = 0!
A policy is optimal if it maximizes the long-term reward
Discounted reward Value function: π
2
V (st ) = rt + γrt+1 + γ rt+2 + . . . =
∞ X
γ i rt+i .
(56)
i=0
Alternative:
h
1X rt+i . V (st ) = lim h→∞ h i=0 π
A policy π ? is
optimal,
in case for all states s ?
V π (s) ≥ V π (s) Acronym: V ? = V π
?
(57)
(58)
Decision processes
I Markov decision process
I
The reward of an action only depends on the current state and current action POMDP (engl. partially observable markov decision process): the agent's state is not fully known
Uninformed combinatory search
2
3
3
2
3
4
4
3
3
4
4
3
2
3
3
2
2
3
3
3
2
3
4
4
4
3
3
4
4
4
3
3
4
4
4
3
Grid
2 2
2 2
2 3 2
3 4 3
2 3 2
# states
4
9
16
25
# policies
24 = 16
24 34 4 = 5184
24 38 44 ≈ 2.7 · 107
24 312 49 ≈ 2.2 · 1012
2
3
3
3
2
Uninformed combinatory search
common: I 4 corner nodes with 2 possible actions I 2(nx − 2) + 2(ny − 2) side nodes with 3 possible actions I (nx − 2)(ny − 2) inner nodes with 4 possible actions thus: 24 32(nx −2)+2(ny −2) 4(nx −2)(ny −2) possible policies
Value of states
π1 :
s0 ↓ ↓ ↓ ↓
→ → → ↑ ↓
← ← ← ↑ ↑ ↑ ↑
π2 :
s0 ← ← ↓ ↓
→ → ↑ ↓ ↓
← ← ← ↑
→ → →
Movement to the right is rewarded by 1, movement to the left is punished by -1 average speed is π1 : 3/8 = 0.375, average speed π2 : 2/6 ≈ 0.333
V π (st ) = rt + γrt+1 + γ 2 rt+2 + . . . γ 0.9 0.8375 0.8 V π1 (s0 ) 2.81 1.38 0.96 V π2 (s0 ) 2.66 1.38 1.00 greater γ : greater time horizon for the evaluation of policies!
Value iteration and dynamic programming Dynamic Programming, Richard Bellman, 1957 [Bel57]: Independently from the start state st , and the rst action at , all decisions of possible successor states st+1 must be optimal. global optimal policy by local optimization!
Wanted: optimal policy π ? that satises
V π (st ) = rt + γrt+1 + γ 2 rt+2 + . . . =
∞ X
γ i rt+i .
i=0
and
?
V π (s) ≥ V π (s)
Therefore
V ? (st ) =
max
at ,at+1 ,at+2 ,...
(r (st , at )+γ r (st+1 , at+1 )+γ 2 r (st+2 , at+2 )+. . .).
r (st , at ) depends only on st and at , thus V ? (st ) = max[r (st , at ) + γ
max (r (st+1 , at+1 ) + γ r (st+2 , at+2 ) + . . .
at+1 ,at+2 ,...
at
(60) ?
= max[r (st , at ) + γV (st+1 )]. at
Bellman-Equation:
(xpoint equation)
V ? (s) = max[r (s, a) + γV ? (δ(s, a))].
(62)
π ? (s) = argmax[r (s, a) + γV ? (δ(s, a))].
(63)
a
thus
(61)
a
Iteration instruction:
(xpoint iteration)
Vˆ (s) = max[r (s, a) + γ Vˆ (δ(s, a))] a
Initialization: ∀s Vˆ (s) = 0
(64)
Value Iteration()
s∈S ˆ V (s) = 0
For all
Repeat
s∈S ˆ V (s) = maxa [r (s, a) + γ Vˆ (δ(s, a))] Vˆ (s) converged
For all Until
Theorem Value iteration converges to
V?
[SB98].
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
0 1
0.73 −1 1.66 −1 2.49
0
0 0
0
0 0
1
0
0 0
0
0 0
1.9
−1
0 0
1.39 0 2.02 0 2.62 0
0 0
1
0 1
1
−1
2.66 0 2.96 0 3.28
0 0
0
0
1.39 0 1.54 0 2.24
0
0
0 0
0
0 0
1
1.9
−1
1.25 0 1.82 0 2.36 0
0 0
0 0
0 1
1
−1
0
0 0
0
0 1
0
0.81 0 1.54 0 1.71 0
0
0
0
−1
1.25 0 1.39 0 2.02 0
0
0 1
0
−1
0 0
0
1.71
0
0 0
0
0
0.9
0
0
0 1
0
0
0
0
0
0
0
0
0
0
0.73 0 1.39 0 1.54 0
0
0
0
0 0
0.81 0 1.54
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1
1.25 −1 2.12 −1 2.91
...
0 0
2.96 0 3.28 0 3.65 0
0 0
0 0
1
0 1
2.66 −1 3.39 −1 4.05
V*
Value iteration with γ = 0.9 and two optimal policies. Chosing an action that leads to the state with highest ? V -value is wrong. Why? Attention:
0
0
2.66 0 2.96 0 3.28 0
Application to s = (2, 3) in
0
0
0
0
0
0
0
2.96 0 3.28 0 3.65 0
0
0
0
0
1
0
1
2.66 −1 3.39 −1 4.05
V*
π ? (2, 3) =
argmax
[r (s, a) + γV ? (δ(s, a))]
a∈{left,right,up}
=
argmax {1 + 0.9 · 2.66, −1 + 0.9 · 4.05, 0 + 0.9 · 3.28}
{left,right,up}
=
argmax {3.39, 2.65, 2.95}
{left,right,up}
= left
The hardware walking robot
Demo:Walking robot
Unknown environment What to do if the agent has no model of its actions?
Vˆ (s) = max[r (s, a) + γ Vˆ (δ(s, a))] a
0
0
2.66 0 2.96 0 3.28 0
0 0
0 0
0
0 0
2.96 0 3.28 0 3.65 0
0 0
0 0
1
0 1
2.66 −1 3.39 −1 4.05
V*
Today's challenge: Go
I I I I I I I
square board with 361 squares 181 white, 180 black stones average branching factor: about 300 after 4 half-moves: 8 · 109 positions classical game tree search processes have no chance! Pattern recognition on the board! Humans are still miles ahead of computer programs.
Breakthrough: GO-program beats human Go Champion I I I I I I I I I
I
No Combinatorial search (such as minimax search)! Pattern matching based on Deep Learning. Deep Convolution Neural Network (DCNN). DCNN predicts the next k moves. Image processing on 19 × 19 board. Monte Carlo Tree search (MCTS). Up to 110,000 rollouts. Wins consistently against commercial Go engines. Tian, Y. & Zhu, Y. Preprint at http://arxiv.org/pdf/1511.06410.pdf (2015). Elizabeth Gibney. Google AI algorithm masters ancient game of Go. Nature 529, 445-446 (28 January 2016)
Q-learning
Evaluation function Q(st , at )
π ? (s) = argmax Q(s, a). a
(65)
Discounting of future rewards and maximizing of
rt + γrt+1 + γ 2 rt+2 + . . . Evaluation of action at in state st :
Q(st , at ) =
max (r (st , at )+γr (st+1 , at+1 )+γ 2 r (st+2 , at+2 )+. . .).
at+1 ,at+2 ,...
(66)
Simplication:
Q(st , at ) =
max (r (st , at ) + γ r (st+1 , at+1 ) + γ 2 r (st+2 , at+2 ) + . . .)
at+1 ,at+2 ,...
(67)
= r (st , at ) + γ
max (r (st+1 , at+1 ) + γr (st+2 , at+2 ) + . . .)
at+1 ,at+2 ,...
(68)
= r (st , at ) + γ max(r (st+1 , at+1 ) + γ max(r (st+2 , at+2 ) + . . .)) at+1
at+2
(69)
= r (st , at ) + γ max Q(st+1 , at+1 )
(70)
= r (st , at ) + γ max Q(δ(st , at ), at+1 )
(71)
= r (s, a) + γ max Q(δ(s, a), a0 ) 0
(72)
at+1
at+1
a
Fixed point equation
is solved iteratively by:
ˆ a) = r (s, a) + γ max Q(δ(s, ˆ Q(s, a), a0 ) 0 a
(73)
The Algorithm for Q-Learning Q-Learning()
s ∈ S, a ∈ A ˆ a) = 0 (or randomly) Q(s,
For all
Repeat
Select (e.g. randomly) a state s Repeat
Select an action a and carry it out Obtain reward r and new state s 0 ˆ a) := r (s, a) + γ maxa0 Q(s ˆ 0 , a0 ) Q(s, s := s 0 Until s is a terminal state Or time limit reached ˆ converges Until Q
Application of the algorithm with and
nx = 3, ny = 2 0
0
0
0
0
0 0
0 0
0.73
0 1
−0.1
1.9
−0.1
1.25
1.9 0.71
−0.1
1.71 1.54
0.81 2.49 0.71
.....
−0.1
0.81 1.9 −0.1
−0.1
0
1.39 1.54
1.39 0.9
−0.09 0.73
1.71 1.54
0.81 1.9
1.66 0.49
0 1
0.73
0.81 1.71
1.66
1.39
2.13 0.49
0.73
0 0
1
1.54 0.9
1.54 1.49
−0.09 1.54
0.81 1.66
0
−0.1
1.39 1.71
1.49 0.73
1.9
0.73
1.39 1.54
1.39
0.81 1
−0.1
0.73
0.73
0
0
−1
1.39 0
1.71
0 0.9
1
0.73
0.81 1.54
0.9
0
1 −1
0
0 0
0
−1
0.81 0
0.9
0
0.81
0.81
0 0.9
0
1
0.73
0.81 0.9
0
−1
0.81
0.73
0
1
0
0.81
0.49
0
0
0
−0.09
0
0.73
0
0.81
0
0 0
0
0
0
0
0
0
γ = 0.9
0.71
Theorem r (s, a) 0 ≤ γ < 1 is used for learning. Let ˆ n (s, a) be the value for Q(s, ˆ a) after n updates. If each Q ˆ n (s, a) state-action pair is visited innitely often, then Q converges to Q(s, a) for all values s and a for n → ∞. Let a deterministic MDP with limited immediate reward
be given. Equation ?? with
Proof: I I
I
All state-action transitions occur innitely often Look at intervals in which all state-action transitions occur, at least once. ˆ -table is reduced in each interval The maximum error in the Q by at least the factor γ :
Let
ˆ n (s, a) − Q(s, a)| ∆n = max |Q s,a
ˆ n and s 0 = δ(s, a). the maximum error in table Q ˆ n (s, a): For all table entries Q ˆ n (s 0 , a0 )) − (r + γ max Q(s 0 , a0 ))| ˆ n+1 (s, a) − Q(s, a)| = |(r + γ max Q |Q 0 0 a
a
ˆ n (s 0 , a0 ) − max Q(s 0 , a0 )| Q = γ | max 0 0 a
a
ˆ n (s 0 , a0 ) − Q(s 0 , a0 )| ≤ γ max |Q 0 a
ˆ n (s 00 , a0 ) − Q(s 00 , a0 )| = γ ∆n . ≤ γ max |Q 00 0 s ,a
The rst inequality holds, because for any function
f
and
g
| max f (x) − max g (x)| ≤ max |f (x) − g (x)| x
x
x
and the second inequality is true because, by additional variation of the state
s 00 ,
the resulting maximum cannot become smaller.
Thus
∆n+1 ≤ γ∆n
and
∆k ≤ γ k ∆0
thus:
lim ∆n = 0
n→∞
Remarks
I
I
Q-learning converges according to Satz ?? independently of the chosen actions during learning. The speed of convergence depends on the actions chosen during learning.
Q-learning in non-deterministic environments
Non-deterministic agent: Response from the environment for taking action a in state s is non-deterministic. X Q(st , at ) = E (r (s, a)) + γ P(s 0 |s, a) max Q(s 0 , a0 ), (74) 0 s0
I I
a
Convergence guarantee for Q-learning is lost! Reason: The same action a in the same state s leads to totally dierent responses from the environment
New learning rule
ˆ n (s, a) = (1 − αn )Q ˆ n−1 (s, a) + αn [r (s, a) + γ max Q ˆ n−1 (δ(s, a), a0 )] Q 0 a
with temporal-variable weight factor
αn = I
I I
(75)
1 . 1 + bn (s, a)
bn (s, a) = number of times action a has been chosen in state s before iteration n. ˆ n−1 (s, a). stabilizing term Q values bn (s, a) for all state-action pairs must be memorized.
TD-Error and TD-Learning αn = α (constant): ˆ n−1 (δ(s, a), a0 )] ˆ n (s, a) = (1 − α)Q ˆ n−1 (s, a) + α[r (s, a) + γ max Q Q 0 a
ˆ n−1 (δ(s, a), a0 ) − Q ˆ n−1 (s, a)] ˆ n−1 (s, a) + α [r (s, a) + γ max Q = Q a0 | {z } TD-Error
I I I
α = 1: Q-learning α = 0: No learning takes place 0 < α < 1: ???
Exploration and exploitation Q-learning()
s ∈ S, a ∈ A ˆ a) = 0 (or randomly) Q(s,
For all
Repeat
Chose (e.g. randomly)
a (which?)
state s
Repeat
Chose a (which?) action a and execute Receive reward r and successor state s 0 ˆ a) := r (s, a) + γ maxa0 Q(s ˆ 0 , a0 ) Q(s, s := s 0 Until s is terminal state Or time limit reached ˆ converges Until Q
Possibilities for choosing the next action
random choice:
leads to equal exploration of all possible actions I very slow convergence ˆ -value:) choosing the best action (highest Q I optimal exploitation of learned policy I relatively fast convergence I non-optimal policies can be learned I
Selection of the start state
0
0
0
0
0
0 0
0 0
0.73
0 1
−0.1
0.73
0.73 −0.1
0.73
1.54
0.81
1.25
1.9 0.71
−0.1
1.71 1.54
0.81 2.49 0.71
.....
−0.1
0.81 1.9 −0.1
−0.1
0
1.39 1.54
1.39 0.9
−0.09 0.73
1.71 1.54
0.81 1.9
1.66 0.49
0 1
0.73
0.81 1.71
1.66
1.39
2.13 0.49
0.73
0 0
1
1.54 0.9
1.54 1.49
−0.09
0
−0.1
1.39 1.71
1.66
1.9
0.73
1.39 1.54
1.49
0.81 1
0
0
−1
1.39 0
1.71
0 0.9
1
0.73
0.81 1.54
0.9
0
−0.1
1.39
0.49
1.9
1 −1
0
0 0
0
−1
0.81 0
0.9
0
0.81
0.81
0 0.9
0
1
0.73
0.81 0.9
0
−1
0.81
0.73
0
1
0
0.81
−0.09
0
0
0
0
0
0.73
0
0.81
0
0 0
0
0
0
0
0
0.71
Function approximation, Generalization and Convergence I I
continuous state variables ⇒ innite state space Table with V − or Q−values can not be stored explicitly
Solution: I I I I
Q(s, a)-table is replaced by a neural network with Input-variables s , a and Q -value as output. Finite representation of (innite) function Q(s, a)! Generalization (from nite training samples)
Attention: No convergence guarantee any more because Satz only holds if all state-action pairs are visited innitely often. Alternative: any other function approximator
??
POMDP
POMDP:
I I I I
partially observable Markov decision process:
many dierent states are recognized as one particular state. many states in the real world are mapped to a observation. Convergence problem with value iteration or Q-learning Possible solutions: [SB98], Observation Based Learning [LR02].
Application: TD-Gammon
I
I I
I I I
TD-Learning (Temporal Dierence Learning) utilizes states that are more far in the future TD-Gammon: a program for playing backgammon TD-Learning using a backpropagation network with 40 to 80 hidden neurons The only reward: Scoring at the end of the game TD-Gammon was trained within 1.5 million games against itself. Beat's Backgammon grandmaster!
Other applications
I
I I
RoboCup: with reinforcement learning a policy for the robot is learned, e.g. dribbling [SSK05; Rob]. Inverse pendulum Control of a Quadrocopter
Problems in robotics: I
I I
extreme computation times in higher dimensional problems (many variables/actions) Feedback of the environment on real robots is very slowly. Better, faster learning algorithms are required.
Landing of air planes
[Russ Tedrake, IROS 08]
Birds don't solve Navier-Stokes!
[Russ Tedrake, IROS 08]
Birds don't solve Navier-Stokes!
[Russ Tedrake, IROS 08]
Birds don't solve Navier-Stokes!
[Russ Tedrake, IROS 08]
Curse of dimensionality
I
Problem: high dimensional state- and action spaces
Solution methods: I I
I I I I
Learning in nature happens on many abstraction layers Computer Science: every learned skill is encapsulated in a module Action space is scaled down States are abstracted Hierachical learning [BM03] distributed learning (Centipede, a brain for each leg)
Curse of dimensionality, other ideas
I I
I
Human brain is at birth no tabula rasa Good initial policy for a robot? I
1. Classical programming.
I
2. Reinforcement learning
I
3. Trainer oers additional feedback
or: I
1. Learning from demonstration (learning with a teacher)
I
2. Reinforcement learning [Bil+08]
I
3. Trainer oers additional feedback
Current state of research
I I I I I
Fitted Value Iteration Connecting reinforcement learning with imitation learning Policy Gradient Methods Actor Critic Methods Natural Gradient Methods
Fitted Value Iteration
3:
Randomly sample m states from the MDP Ψ=0 n = the number of available actions in A
4:
repeat
5:
for
1: 2:
6: 7:
i = 1 → m do for j = 1 → n do
q(a) = R(s (i) ) + γV (s (j) )
8:
end for
9:
y (i) = maxj q(a)
10: 11: 12:
end for
P 2 (i) Ψ = arg minΨ m − ΨT Φ(s (i) )) i=1 (y until Ψ Converges
www.teachingbox.org
I
reinforcement learning algorithms: I
value iteration
I
Q(λ), SARSA(λ)
I
TD(λ)
I
tabular and function approximation versions
I
actor critic
I
tile coding
I
locally weighted regression
I
Example Environments: I
mountain car
I
gridworld (with editor), windy gridworld
I
dicegame
I
n armed bandit
I
pole swing up
Literatur
I
I
I
First introduction: T. Mitchell. Machine Learning. www-2.cs.cmu.edu/~tom/mlbook.html. McGraw Hill, 1997 Standard work: R. Sutton und A. Barto. Reinforcement Learning. www.cs.ualberta.ca/~sutton/book/the-book.html. MIT Press, 1998 Overview: L.P. Kaelbling, M.L. Littman und A.P. Moore. Reinforcement Learning: A Survey. In: Journal of Articial Intelligence Research 4 (1996). www-2.cs.cmu.edu/afs/cs/ project/jair/pub/volume4/kaelbling96a.pdf, S. 237285