Apr 4, 2014 - First a brief description of First Order Logic will be given. In the second part Markov Random .... It doesn't matter what this objects are, the only ...... Evidence .db files. -q Query atoms (comma-separated with no space).
Introduction to Markov Random Fields and Markov Logic Networks Mario Tambos - Matr. 8992525 April 4, 2014
The goal of this work is to give an introduction to Markov Random Fields and Markov Logic Networks, with a focus on the latter's usage in practice. First a brief description of First Order Logic will be given.
In the second
part Markov Random Fields - an undirected kind of Probabilistic Graphical Model in which the Markov Property holds - are introduced and explained in some detail, together with an example of how they work.
Then we will
see how both previous topics can be combined in the form of Markov Logic Networks. Following this, we will concern ourselves with a detailed, although simple, example in the eld of Natural Language Processing. To nalize we will see the advantages and problems of this method, in particular with its expressiveness and computational cost.
Contents 1. Introduction
2
I.
3
First Order Logic
2. Brief Explanation
II.
3
2.1.
Entities
2.2.
Syntax and Semantics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.3.
Interpretation and Grounding . . . . . . . . . . . . . . . . . . . . . . . . .
5
Markov Random Fields
3. Basic Concepts and Denitions 3.1.
Markov Properties
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
7
7 11
1
1. Introduction
4. Inference 4.1.
12
Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
III. Markov Logic Networks
16
5. Introduction
16
6. Probability distribution
19
6.1.
Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7. Tools
20
8. Applications 8.1.
20
23
On NLP - Statistical Parsing
. . . . . . . . . . . . . . . . . . . . . . . . .
23
9. Pros and Cons
25
IV. Final Words
26
10.Conclusion
26
References
26
1. Introduction In many elds the need arises to nd a succinct representation for known facts that also allows to discover (or
infer )
new knowledge.
Logical
and
probabilistic
are two very
popular approaches to solve this problem. In the rst case, knowledge is written down as logic formulas, and new formulas (new facts) are derived by using logic equivalencies. In the second, facts are seen as random variables, over which statistical/probabilistic reasoning is applied in order to calculate the likelihood of events.
We will start this
First-order Logic, a generalization of propositional logic that falls strictly in the rst category. Then we will go on to dene Markov Random Fields, a kind of Probabilistic Graphical Model a probabilistic approach that is also able work by describing
to express propositional logic formulas. Finally we introduce Markov Logic Networks, a powerful tool that generalizes both methods. For this work previous knowledge in the basics of set theory, formal logic and probability theory is assumed.
2
First Order Logic Part I.
First Order Logic 2. Brief Explanation Although propositional logic (PL) is a very useful language, it lacks conciseness: in order to write the same fact about three dierent objects, one needs to write three dierent formulas, one for each object. So, if we want for instance to say all natural numbers are greater than zero, we would have to write
(1 > 0) ∧ (2 > 0) ∧ (3 > 0) ∧ · · ·
and so on ad
innitum. First-order logic (FOL) is a formal declarative language that tries to solve this. It builds on propositional logic, and adds several elements in order to increase its expressiveness. In other words, there are things said in rst-order logic, that cannot be said in propositional logic. Besides its expressiveness, at a conceptual level, what sets FOL appart from PL is their ontological commitments: what each assumes about the nature of reality. According to [20, p. 289]: ... propositional logic assumes that there are facts that either hold or do not hold in the world. Each fact can be in one of two states: true or false [...] First-order logic assumes more; namely, that the world consists of objects with certain relations among them that do or do not hold. First-order Logic lets us build
models
mathematical abstractions that x the truth
or falsehood of every relevant formula in the language ([20, p. 240]). Each model has a
domain, that is, a set of objects it contains ([20, p.
290]). It doesn't matter
what
this
objects are, the only restriction is that the domain is nonempty.
2.1. Entities
Note. Based on [17, pp. 109-110] and [20, p. 288,289,292]. FOL denes four types of symbols:
• Constant symbols Bob, House, Car.
represent (xed) objects in the domain, e.g.
Alice, Carol,
• Variable symbols stand for any one object in the domain, e.g. x : x ∈ {Alice, Carol, Bob, Hou • Predicate symbols describe relations among objects (e.g. F riends) or attributes of objects (e.g. Red). A predicate has a truth value F riends(Bob, Carol) and Red(Car) are true or f alse. • Function symbols dene mappings from tuples of objects to objects (e.g. M otherOf , which maps Persons (daughters) to Persons (their mothers)). A function has an
object value M otherOf (Alice)
is
Carol.
3
2.2
First Order Logic
Syntax and Semantics
formula is a sentence in the language of FOL, constructed with the four previous symbols, and a rst-order knowledge base (KB) is a set of formulas in FOL. Additionally, a
2.2. Syntax and Semantics
Note. Based on [17, pp. 109-110] and [20, pp. 294-298]. FOL provides the following rules for the use of its symbols:
•
type
A
is a subset of the objects in the domain. If a variable is typed (they may
not be), it ranges only over objects of the corresponding type, and if a constant is typed, it can only represent objects of the corresponding type. For example, the variable
•
x that ranges over people (e.g., Anna, Bob, etc.)
cannot be used with
term is any expression capable of representing an object in the domain, that is,
A
a constant, a variable, or a function applied to a tuple of terms. For example,
x, •
Car.
and
An
M otherOf (x)
atomic formula
(e.g.,
Bob,
are terms. or
atom
is a predicate symbol applied to a tuple of terms
Red(CarOf (Carol))).
Formulas are recursively constructed from atomic formulas as follows:
Denition 1. FOL formulas(Based on [17, pp. 109-110]): A positive Literal is an atomic formula; a negative literal formula. If
F1
and
F2
are formulas, then 1.-7. are also formulas:
1.
¬F1 (negation):
2.
F1 ∧ F2 (conjunction):
3.
F1 ∨ F2 (disjunction):
4.
F1 =⇒ F2 (implication):
5.
F 1 ⇐⇒ F 2 (equivalence):
6.
is a negated atomic
true i
F1 is
false;
true i both true i
F1
true i
or
F1
and
F2
is true;
F1
true i
∀x F1 (universal quantication):
F2
are true;
F2
is false or
F1
and
F2
true i
F1
is true;
have the same truth value; is true for every object
x
in the
domain; 7. and
x
∃x F1 (existential quantication):
true i
F1
is true for at least one object
in the domain.
Parentheses may be used to enforce precedence. All unquantied variables are assumed to be universally quantied, and the formulas in a KB are
1
implicitly conjoined.
Therefore, a KB can be viewed as a single large
formula .
1
4
This means: if a
KB
consists of the formulasF1 , F2 , · · ·
, Fn ,
thenKB
≡ F1 ∧ F2 ∧ · · · ∧ F n .
First Order Logic
2.3
Interpretation and Grounding
2.3. Interpretation and Grounding
Note. Based on [17, pp. 109-110]. In addition to the formulas themselves, we need something to decide whether any given given formula is true or false ([20, p. 292]). We do this by means of an
interpretation
a set of rules that species exactly which objects, relations and functions are referred to by the constant, predicate, and function symbols, that is a set of rules that gives
meaning
to the symbols.
Example 1. • Alice
We could say that:
Bob refers to the Helen is Alice's mother.
refers to the character Alice from Alice in Wonderland,
character SpongeBob from SpongeBob SquarePants and
• F riends refers to the friendship relation dened as: F riends(Alice, Carol) ≡ T rue; F riends(Alice, Bob) ≡ T rue; F riends(Carol, Bob) ≡ F alse. • M otherOf Helen.
refers to the motherhood function dened as:
M otherOf (Alice) =
Although it is possible to check facts in a FOL-KB without further renement of its formulas, for this work we will restrict ourselves to work only with KBs where: 1. All formulas are
grounded
in other words, we need to eliminate all variables
from all formulas, by replacing them for the appropriate constants. 2. All atoms have a truth value assigned to do this we recursively assign truth values following the rules in Denition 1. A
Herbrand interpretation
is an interpretation that contains this last rule. For in-
stance a Herbrand interpretation of the KB shown in Table 2.1 with constants
{Alice, Bob}
can be seen in Table 2.2
5
2.3
First Order Logic
Interpretation and Grounding
English
First-order logic
Clausal form
Weight
Friends of
∀x ∀y ∀z F r(x, y) ∧
¬F r(x, y) ∨ ¬F r(y, z) ∨ F r(x, z)
0.7
friends are
F r(y, z) =⇒ F r(x, z) F r(x, g(x)) ∨ Sm(x)
2.3
∀x Sm(x) =⇒ Ca(x)
¬Sm(x) ∨ Ca(x)
1.5
∀x ∀y F r(x, y) =⇒
[¬F r(x, y) ∨ Sm(x) ∨ ¬Sm(y)] ∧
1.1
(Sm(x) ⇐⇒ Sm(y))
[¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)]
friends Friendless
∀x ¬[∃y F r(x, y)] =⇒
people smoke
Sm(x)
Smoking causes cancer If two people are friends, either both smoke or neither does.
Table 2.1: Example of a rst-order knowledge base and MLN.
Sm()
for
Smokes(),
and
Ca()
for
Cancer()
F r() is short for F riends(),
(from [17, p. 111])
Atom
Interpretation
Atom
Interpretation
Friends(Alice,Alice)
True
Smokes(Alice)
True
Friends(Bob,Bob)
True
Smokes(Bob)
False
Friends(Alice,Bob)
False
Cancer(Alice)
False
Friends(Bob,Alice)
False
Cancer(Bob)
False
Table 2.2: Possible Herbrand interpretation for the KB shown in Table 2.1 with constants
{Alice, Bob}.
6
Markov Random Fields Part II.
Markov Random Fields Sometimes we have to deal with uncertain environments. We already have the tools to do it by means of statistics and probability theory. Their methods are, however, for some applications very verbose, and not that helpful when trying to get an overview of the interaction between dierent variables. Probabilistic graphical models try to solve this by encoding the variables and their relationships in a graph.
3. Basic Concepts and Denitions The easiest way of thinking about MRFs is as an undirected graph with weights associated to its nodes, edges or some combination thereof. According to [11, p. 6], the MRF's nodes represent variables, and its edges some notion of direct probabilistic interaction between neighboring variables. For instance, if one were to represent the probabilistic interactions of smoking habits, cancer and friendship among two persons, the graph would be something like Figure 3.1. In this graph each node represents an assertion over the person(s) and each edge represents a relationship among assertions. e.g. the relationship between their smoking habits or between smoking and cancer; the ellipse-marked cliques group the relationships we want to consider as the basic blocks of our calculations (more on this later). As we can see, each clique with it; the meaning of states. The
πi s
fi
Ci
has a weight function
are also called
and a function
factors, and we dene them as:
Denition 2. Factor (taken from Let
πi
will be explaned later. The value of
πi
fi
associated
depends on the nodes'
):
[11, p. 7]
X = {X1 , X2 , . . . , Xn } be a set of random variables. We dene a factor to V al(X) → R+, with V al(X) = Im(X1 ) × Im(X2 ) × · · · × Im(Xn ).
be a
function from
It is important to realize that these weights are
not
probabilities. These are calculated
as a product of the involved (in the evidence) weights, normalized by all the possibilities. This way of calculating probabilities over a graph is called a
clique factorization, and
we dene it formally as:
Denition 3. Clique Factorization (based on [11, pp. 7-8] and [17, pp. 108-109]): Let G be an undirected graph. A distribution PG factorizes over G if it is associated with
•
a set of cliques
factors
C1 , . . . , C k
of
G;
π1 [C1 ], . . . , πk [Ck ]
7
Markov Random Fields
3. Basic Concepts and Denitions
(a): The graph without weights.
(b): The cliques sociated
(c): The cliques sociated
π3
and
C3 and C4 π4 .
with their as-
π1
and
(d): The cliques sociated
π5
and
C1 π2 .
and
C2
with their as-
C5 π6 .
and
C6
with their as-
Figure 3.1: A simple Markov network describing the probabilistic interactions of smoking
Ca F riends.
habits, cancer and friendship among two persons.
Sm
Denition 4.
is short for
Smokes
and
Fr
is short for
is short for
Cancer,
such that
PG (X = x) =
1 0 P (X = x) Z G
(3.1)
where
X = x ≡ {X1 = x1 , X2 = x2 , . . . , Xn = xn }, ∀i, Xi ∈ X ∧ xi ∈ x ∧ xi ∈ V al(Xi ) is an assignment to all the random variables in
PG0 (X = x) =
X,
k Y
πi [Ci ]
i=1 is an unnormalized measure and
Z=
X x∈V al(X)
8
PG0 (X = x)
Markov Random Fields
3. Basic Concepts and Denitions
is a normalizing constant called the partition function.
Figure 3.2: A sub-network of the one in Figure 3.1.
Example 2. with the
πi
If we consider a sub-graph, shown in Figure 3.2, of the graph in Figure 3.1,
dened as seen in the gure's tables, then we can calculate the probability
9
3. Basic Concepts and Denitions
Markov Random Fields
Figure 3.3: A sub-network of the one in Figure 3.1.
of the state
{Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1}
as follows:
0
P (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) = π3 [C3 ] × π5 [C5 ] × π6 [C6 ] = π3 [F r(A, B) = 1, Sm(A) = 0, Sm(B) = 0] × π5 [Sm(A) = 0, Ca(A) = 0] × π6 [Sm(B) = 0, Ca(B) = 0] = 1.1 × 1.5 × 1.5 = 2.475 Z=
X
P 0 (X = x)
x∈V al(X)
= P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 0) + P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) + P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 1, F r(A, B) = 0) ··· + P 0 (Ca(A) = 1, Ca(B) = 1, Sm(A) = 1, Sm(B) = 1, F r(A, B) = 1) = 133.65 10
P (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) 1 = P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) Z 2.475 = 133.65 ≈ 0.0185
Markov Random Fields
3.1
Markov Properties
X = {Ca(A), Ca(B), Sm(A), Sm(B), F r(A, B)}.
where
Sometimes working with multiplications is not very desirable, specially if computers are doing the work, since numbers can grow very large or very small quite quickly. Because of this people frequently use sums of logarithms instead, which are numerically more stable.
Denition 5. Logistic Model(based on [11, pp. 7-8] and [17, pp. 108-109]): Let be wi = ln(πi [Ci ]) and let fi be some binary function (a feature) over the of the clique
i,
state
then:
! PG0 (X
= x) = exp
X
wi fi (x)
(3.2)
i
Actually we are not restricted to only binary
fi s, but we will focus exclusively on them
for the rest of this work. Using the concept of clique factorization we now dene a Markov Random Field formally as:
Denition 6. Markov Random Field: Let
H be an undirected graph, we PH that factorizes over H.
call
H
a Markov Random Field i there is a distri-
bution
3.1. Markov Properties
Smokes(A) Smokes(B) and
Something that may come to our attention is that if we knew the state of the variable, then the state of
Cancer(A)
is independent of the state of
Conditional Independence the fact that two variables are independent of each other, only if a set of variables is known. We formalize F riends(A, B).
This is known as
this as follows:
Denition 7. Conditional Independence(taken from [11, p. 2]): X is conditionally independent from Y given Z, written as
(X ⊥ Y | Z),
i
P (X = x, Y = y|Z = z) = P (X = x|Z = z)P (Y = y|Z = z) : ∀x ∈ V al(X), y ∈ V al(Y ), z ∈ V al(Z) We see this in Figure 3.4: we observe factors
π3
and
Smokes(A) = 0,
and that observation modify the
π5 in a way that makes impossible further information Smokes(B) nor F riends(A, B), and vice versa.
about
Cancer(A)
to aect neither
In MRFs conditional independence comes in two avors: the Local and Global Markov Properties.
11
Markov Random Fields
4. Inference
Sm(A) = 0. Sm(A).
Figure 3.4: The same Markov network of Figure 3.2, conditioned on can see, the factors
π3
and
π5
not longer depend on
As we
The Local Markov Property states exactly what our insight above said. That is, that any given node is conditionally independent of all other nodes in the MRF, given the
Markov blanket
node's neighbors. In MRF-lingo those neighbors are called the node's (blanket for short). Formally:
Denition 8. Local Markov Property (taken from [11, p. 9]): Let
H
denoted
X ).
be an undirected graph. Then for each node
NH (X),
is the set of neighbors of
X
X ∈ X,
the Markov blanket of X,
in the graph (those that share an edge with
We dene the local Markov independencies associated with
H
to be
I(H) = (X ⊥ X − {X} − NH (X) | NH (X)) : X ∈ X The Global Markov Property has a similar idea, but regarding the totality of the MRF's nodes. We will not treat it in this work, but you can read its details in
[11, pp. 9-10].
4. Inference Note. Based on [11, pp. 10-11]. With the word inference one means in this context the act of querying the MRF. That is, given a subset of variables
12
E⊆X
and an
instantiation an assignment of a state to
Markov Random Fields a variable
e
4.1
of those variables, our
for some other subset of variables the variable
X
given
Gibbs Sampling
evidence, we would like to know P (X = x|E = e)
X ⊆ X,
i.e. the probability of some instantiation
x
of
e.
We will focus on one type of such queries, the
Maximum A Posteriori (MAP),
in which we want to calculate not only the probability of an instantiation, but the instantiation for which that probability is maximal as well; formally:
argmaxx P (X =
x|E = e). Sadly this kind of problem is very costly to compute exactly (Theorems 2.15-16 in [11]). However we
can
save a lot of computation time if we content ourselves with an
approximation, at least in most cases.
4.1. Gibbs Sampling
Note. Based on [11, p. 24-29] and [17, pp. 116-118].
Gibbs Sampling is one way of approximating the MAP. In it we simply try to estimate the value of
Xi
given the value of its blanket,
Its procedure is quite simple: for each on
Xi 's
not using
i = 1, . . . , n,
the value of
Xi
on previous states.
randomly sample
Xi ,
conditioned
blanket; and keep resampling until convergence. The rst steps of the method
can be seen applied to our example of Figure 3.2 in Figure 4.1.
Denition 9. Sampling Probability: Using a logistic model, and taking into account that we are considering binary variables, the probability with which each blanket of
Xi
xi is sampled can be calculated based only on the Markov
as
P (Xi = xi |Bi = bi ) w f (X = x , B = b ) i i i i fj ∈Fi j j P P = exp w f (X = 0, B = b ) + exp w f (X = 1, B = b ) j j i i i j j i i i fj ∈Fi fj ∈Fi exp
where
wj
P
(4.1)
Bi is the Markov blanket of Xi , bi is a complete instantition of Bi 's variables; and fj are as dened in Subsection 5, and Fl is the set of features Xl appears in.
and
13
4.1
Gibbs Sampling
Example 3.
Markov Random Fields
For instance, if we dene the unnormalized measure
P 0 (Sm(B) = x | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(π3 [C3 ])f3 (Sm(B) = x, Sm(A) = 0, F r(A, B) = 1) = exp + ln(π6 [C6 ])f6 (Sm(B) = x, Ca(B) = 1) ln(1.1)f3 (Sm(B) = x, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = x, Ca(B) = 1) P 0 (Sm(B) = 1 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(1.1)f3 (Sm(B) = 1, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = 1, Ca(B) = 1) = exp (ln(1.5)) = 1.5 P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(1.1)f3 (Sm(B) = 0, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = 0, Ca(B) = 1) = exp (ln(1.1) + ln(1.5)) = 1.65 then we can compute the rst step of Figure 4.1 as follows:
P (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) +P 0 (Sm(B) = 1 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) 1.65 = ≈ 0.524 1.5 + 1.65 =
14
Markov Random Fields
4.1
(a): Evidence node in gray.
(b):
First
conditioned
step on
a
Gibbs Sampling
-
random
Ca(B) and F r(A, B) evidenceSm(A) = 0.
tion of
(c): Second step - sample
(d):
ditioned on the sampled
F r(A, B) conSm(B) = 0 and
tioned on the sampled
Ca(A) condiSm(A) = 0.
(f ):
the evidence
(e):
Sm(B)
sample
initializaand the
Ca(B) condiSm(B) = 0.
Third step - sample
Sm(A) = 0.
Fourth step - sample
tioned on the evidence
Sm(B) conCa(B) = 0 and evidenceSm(A) =
Fifth step - re-sample
ditioned on the sampled
F r(A, B) = 0 0.
and the
Figure 4.1: First steps of a Gibbs samplig for our example of Figure 3.2.
15
Markov Logic Networks Part III.
Markov Logic Networks 5. Introduction Although the formulas in Table 2.1 might not
always
normally
be true in the real world, they are
true; that means our FOL-KB will always be non-satisable.
In the vast
majority of cases, it is very dicult to come up with non trivial and satisable KBs, and those KBs only reect a part of the domain knowledge. This problem could be solved if we thought about a KBs not as true/false absolutes, but more like soft assertions if some rules in a KB are broken, it doesn't become automatically unsatisable, it just become less likely. This kind of reasoning has been used to link propositional logic and probability in the form of probabilistic graphical models. Now we would like to extend the same concept to rst-order logic as well. Intuitively we introduce
Markov Logic Networks
logic and Markov random elds.
as an hybrid between rst-order
Our objective is to dene a method of describing
relationships among objects, as in rst-order logic, but making those descriptions probabilistic. We would like also to have a way of expressing that we are more sure about some descriptions than about others in our KB, so that they aect the likelihood of our KB more markedly. We dene Markov Logic Networks formally as follows:
Denition 10. (Taken from [17, p. 111])A Markov logic network (MLN) L set of pairs
(Fi , wi ),
where
Fi
is a formula in rst-order logic and
Together with a nite set of constants
ML,C 1.
is a
is a real number.
it denes a Markov network
(Equation 3.1 and Equation 3.2) as follows:
ML,C
contains one binary node for each possible grounding of each predicate ap-
pearing in 2.
C = {c1 , c2 , ..., c|C| },
wi
ML,C
L.
The value of the node is
1 if the ground atom is true, and 0 otherwise.
contains one feature for each possible grounding of each formula
1 if the ground formula is true, wi associated with Fi in L.
value of this feature is of the feature is the
We will name each
ML,C
a
and
Fi ∈ L.
0 otherwise.
The
The weight
ground Markov network.
We can create an MLN from a rst order KB by assigning a weight to each formula. In the case shown in Table 2.1, the table's last two columns form an MLN. The graphical
ML,C on the other hand follows from Denition 10: there is an edge between ML,C i the corresponding ground atoms appear together in at least one grounding of one formula in L. Thus, the atoms in each ground formula form a (not necessarily maximal) clique in ML,C . The syntax of the formulas in an MLN is the same
structure of
two nodes of
described in Part I. As [17, p. 111] puts:
16
Markov Logic Networks
5. Introduction
An MLN can be viewed as a template for constructing Markov networks. Given dierent sets of constants, it will produce dierent networks, and these may be of widely varying size, but all will have certain regularities in structure and parameters, given by the MLN (e.g., all groundings of the same formula will have the same weight).
Example 4.
In the case of the KB displayed in Table 2.1, using only the two last
formulas and with constants cess of constructing
ML,C
Alice
and
Bob (A
and
B
for short), we can see the pro-
in Figure 5.1. Starting from the formulas
(a): Ground predicates.
¬Sm(x) ∨ Ca(x),
(b): If two people are friends, either both smoke or neither does.
(c): Smoking causes cancer.
(d): Cliques that have associated weights.
Figure 5.1: Ground Markov network construction for the last two formulas in Table 2.1, with constants Alice (A) and Bob (B).
(¬F r(x, y)∨Sm(x)∨¬Sm(y))∧(¬F r(x, y)∨¬Sm(x)∨Sm(y)), the process can be taken appart as follows: 1. Identify ground predicates and add as nodes. In this case all the groundings of the predicates are
F r(A, A), F r(B, B), F r(A, B), F r(B, A).
2. Add an edge for each pair of grounded predicates in
¬Sm(x) ∨ Ca(x).
17
Markov Logic Networks
5. Introduction
3. Add an edge for each pair of grounded predicates in
(¬F r(x, y)∨Sm(x)∨¬Sm(y))∧
(¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)).
ML,C for the same two formulas as in Figure 5.1, constant Carol (C for short).
In Figure 5.2 we build another time using the additional
(a): Ground predicates.
but this
(b): If two people are friends, either both smoke or neither does.
(c): Smoking causes cancer. Figure 5.2: Another ground Markov network construction for the last two formulas in Table 2.1, this time with constants Alice (A), Bob (B) and Carol (C).
Once build, the
ML,C
can be used to infer the probability that Alice smokes given that
she's friends with Bob, that Carol has cancer given that Alice smokes, etc. We do this by rst grounding and constructing the (Herbrand) interpretation obtain by this process is what we call a
2
In the sense of Subsection 2.3.
18
possible world.
2 of the KB. What we
Markov Logic Networks
6. Probability distribution
6. Probability distribution From Denition 10, Equation 3.1 and Equation 3.2, the probability distribution over a possible world
x
specied by the ground Markov network
P (X = x) =
ML,C !
is given by
X 1 exp wi ni (x) Z
(6.1)
i
where
ni (x)
Fi
is the number of true groundings of
in
x
and
wi = ln(πi )
.
Now that we know how to calculate the probability distribution based on each formula's weights, we can make this assertion: a world smoke is
e2.3n
W
where
n
people with no friends don't
times less likely than a world where all friendless people smoke:
P (W | N = n) =
exp (2.3(|W |
− N ))
exp (2.3|W |)
− n)) exp (2.3|W |) exp (2.3|W |) exp (−2.3n) = exp (2.3|W |) exp (−2.3n) exp (2.3n) 1 = exp (2.3n) exp (2.3|W |) = exp (−2.3n) exp (2.3(|W |) = exp (−2.3n) P (W | N = 0) =
exp (2.3(|W |
Notice that, if we were to use rst order logic,
any
world where even a single friendless
person is non-smoker would be impossible. For the rest of this work we will make the three following assumptions. As [17, pp. 112] explains, these assumptions: [...] ensure that the set of possible worlds for
(L, C)
is nite, and that
ML,C
represents a unique, well-dened probability distribution over those worlds, irrespective of the interpretation and domain. These assumptions are quite reasonable in most practical applications, and greatly simplify the use of MLNs.
Assumption 1. Unique names (taken from [17, p. 112]). Dierent constants refer to dierent objects ([10]).
Assumption 2. Domain closure (taken from [17, p. 112]). The only objects in the
domain are those representable using the constant and function symbols in (L, C) ([10]).
Assumption 3. Known functions (taken from [17, p. 112]). For each function appearing in L, the value of that function applied to every possible tuple of arguments is known, and is an element of C .
19
6.1
Markov Logic Networks
Inference and Learning
Using this last assumption, we can replace functions by their values when grounding formulas. When assigning wights to the formulas, if a formula contains more than one clause, its weight is divided equally among the clauses, and a clause's weight is assigned to each of its groundings. For an algorithm to construct all groundings please read [17, p. 113]. If we think a little about it, we can see that Markov Logic Networks can encode any Markov Random Field (or Bayesian Network) and any rst-order logic KB (by setting all weights to
∞).
This is shown in
Proposition 4.2 and Proposition 4.3 of [17].
6.1. Inference and Learning Since each ground Markov network
ML,C
is a Markov random eld, we can make inference
in it as we explained in Section 4. In this context inference answers queries of the form
P (F1 | F2 , L, C) that is, what is the probability of a formula formula
F2 ,
using the MLN
L
F1 ,
given that we know the truth value of
and the set of constants
C.
One common step before making queries over MLN is
learning
the process of
adapting either the structure and/or the weights of a MLN to reect some set of data, referred to as
training data,
with the ultimate goal of being able to make accurate
inference. Although we will not go deeper into neither topic, we will oer some tools for and examples of their usage in Section 7 and Section 8. For details and formalism please refer to [17, pp. 116-120].
7. Tools Note. Based on [7] and [19].
Alchemy
is a software tool that let us dene Markov Logic Networks and perform
learning and inference tasks on them.
learnwts,
and
infer,
It provides three shell programs,
learnstruct,
that let us perform structure learning, weight learning and infer-
ence, respectively. Each command's arguments can be seen in Table 7.1 and Table 7.2. -i
Input .mln les
-o
Output .mln le
-t
Training .db les Table 7.1:
learnstruct and learnwts arguments.
Ground atoms are dened in .db (database) les. Ground atoms preceded by ! are false, by ? are unknown, and by neither are true. A .db le consists of a set of ground atoms, one per line.
20
Markov Logic Networks
7. Tools
-i
Input .mln les Output le containing inference results
-r
-e
Evidence .db les
-q
Query atoms (comma-separated with no space) Table 7.2:
infer arguments.
A .mln le has a declaration part and a formula part. The declaration section must contain at least one predicate, while the formulas section contains zero or more formulas. You can express an arbitrary rst-order formula in an .mln le. The syntax for logical connectors is shown in Table 7.3. Symbol
Meaning
Precedence
!N
Not.
1
^
And.
2
v
Or.
3
=>
Implies.
4
If and only if.
5
FORALL,forall,Forall
Universal quantication.
6
EXIST,exist,Exist
Existential quantication.
6
.
Hard formula.
NA
//,/**/
Comment.
NA
@,$
Reserved character.
NA
funcVar,isReturnValueOf
Reserved words. Neither variables
NA
nor predicates should start with them. Table 7.3: Logical operators and special characters/words. A lower number in the precedence column indicates a higher precedence.
Free variables are interpreted as universally quantied at the outermost level. Quantiers can be applied to more than one variable at once (e.g., forall x,y). A formula in an .mln le can be preceded by a number representing the weight of the formula. A formula can also be terminated by a period (.), indicating that it is a hard
3
formula . However, a formula cannot have both a weight and a period. A legal identier is a sequence of alphanumeric characters plus the characters - (hyphen), _ (underscore), and ' (prime); ' cannot be the rst character. Variables in formulas must begin with a lowercase letter, and constants must begin with an uppercase one. Constants may also be expressed as strings (e.g., Alice and A Course in Logic are both acceptable as constants). Alchemy replaces existentially quantied subformulas by disjunctions of all their groundings; therefore, existential quantiers (or negated universal quantiers) should be used
3
That is, a formula whose weight is
∞
21
Markov Logic Networks
7. Tools
with care. Types and constants can be declared in an .mln le with the following syntax:
typename >= {< constant1 >, < constant2 >, ...}, e.g., person = {Alice, Bob}. can also declare integer types, e.g., ageOf Student = {18, ..., 22}. Each declared
Smokes(y)) 1.5 (!Smokes(x) v Cancer(x)) 1.1 (!Friends(x,y) v Smokes(x) v !Smokes(y)) ^ (!Friends(x,y) v !Smokes(y) v Smokes(x)) corresponding to Equation 7.1-Equation 7.4:
¬F r(x, y) ∨ ¬F r(y, z) ∨ F r(x, z)
(7.1)
∀x ¬[∃y F r(x, y)] =⇒ Sm(x)
(7.2)
¬Sm(x) ∨ Ca(x)
(7.3)
[¬F r(x, y) ∨ Sm(x) ∨ ¬Sm(y)] ∧[¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)] and then write .db le:
Friends(Alice, Bob) Friends(Alice, Carol) Smokes(Carol) !Cancer(Bob) We can then make inference with this les by executing:
infer -i example.mln -e evidence.db -r result.db -q Cancer,Smokes and as a result we would obtain:
22
(7.4)
Markov Logic Networks
8. Applications
Cancer(Alice) 0.80397 Cancer(Carol) 0.79797 Smokes(Alice) 0.99895 Smokes(Bob) 0.954955 the probabilities for cancer of Alice and Carol, and the probabilities of being a smoker of Alice and Bob.
8. Applications The applications of MLNs range from Natural Language Processing (
NLP, [4, 8, 21]) to
Computer Vision([1, 6, 9, 12, 22]), Network Analysis ([5]) and Bioinformatics ([3]). In the following section we will concern ourselves with one particular application in the eld of natural language processing.
8.1. On NLP - Statistical Parsing
Note. Based on [7, pp. 13-15]. Statistical parsing is a family of parsing methods that extends rule based parsing by adding a weight to each rule. These weights are normally learned from corpus of text. In this subsection we would like to show how to perform statistical parsing over a small subset of English using MLNs. We will dene the parsing rules as a context free grammar in Chomsky normal form. In it we are restricted to rules of the form:
A→BC A→a where
a
is a terminal symbol and
A, B ,
and
C
are non-terminals.
We now dene a very simple grammar for parsing (also simple) sentences in English:
S → NP V P N P → Adj N N P → Det N V P → V NP This grammar is composed of sentences (S ), noun and verb phrases (N P and
V P ),
adjectives (Adj ), nouns (N ), determiners (Det), and verbs (V ). We can now write this rules in FOL as follows:
NP(i,j) ^ VP(j,k) => S(i,k) // S -> NP VP Adj(i,j) ^ N(j,k) => NP(i,k) // NP -> Adj Noun
23
8.1
Markov Logic Networks
On NLP - Statistical Parsing
Det(i,j) ^ N(j,k) => NP(i,k) // NP -> Det N V(i,j) ^ NP(j,k) => VP(i,k) // VP -> V NP NP(i,k) ^ Det(i,j) => !Adj(i,j) where
i
and
j
are indexes between words, plus one at the beginning and one at the end
of a sentence, e.g. the words in the sentence Mary had a little lamb would be indexed as 0 Mary1 had2 a3 little4 lamb5 .
The last rule serves the purpose of solving the conict
between the second and third rule. Informally, it says that if a noun phrase results in a determiner and a noun, it cannot result in and adjective and a noun. We also need a collection of words, and rules indicating what those words are:
// Determiners Token("a",i) => Det(i,j) Token("the",i) => Det(i,j) // Adjectives Token("big",i) => Adj(i,j) Token("small",i) => Adj(i,j) // Nouns Token("dogs",i) => N(i,j) Token("dog",i) => N(i,j) Token("cats",i) => N(i,j) Token("cat",i) => N(i,j) Token("fly",i) => N(i,j) Token("flies",i) => N(i,j) // Verbs Token("chase",i) => V(i,j) Token("chases",i) => V(i,j) Token("eat",i) => V(i,j) Token("eats",i) => V(i,j) Token("fly",i) => V(i,j) Token("flies",i) => V(i,j) The only remaining problem, is that some words can fall into more than one category (in our case only y and ies). We must then make sure that such words get assigned only one category. One way to do this is to write mutual exclusion rules for each part-of-speech pair:
!Det(i,j) !Det(i,j) !Det(i,j) !Adj(i,j) !Adj(i,j) !N(i,j) v
v !Adj(i,j)//a v !N(i,j) //a v !V(i,j) //a v !N(i,j) //a v !V(i,j) //a !V(i,j) //a
word word word word word word
can can can can can can
be be be be be be
only only only only only only
one one one one one one
of of of of of of
{det., {det., {det., {adj., {adj., {noun,
adj.} noun} verb} noun} verb} verb}
We can now use Alchemy to train our MLN using some database, and use the result to make inference. For instance, if we were to input the sentence
24
Markov Logic Networks
9. Pros and Cons
// Sentence: Small dogs chase big flies Token("small",0) Token("dogs",1) Token("chase",2) Token("big",3) Token("flies",4) S(0,5) we would expect the inference engine to output something like
Adj(0,1) N(1,2) V(2,3) Adj(3,4) N(4,5) NP(0,2) NP(3,5) VP(2,5)
// // // // // // // //
small dogs chase big flies small dogs big flies chase big flies
For the sake of brevity we have skipped some details from the rule's denitions, in consequence this is
not
a working example. One can however get this and other more
complete working examples at [15].
9. Pros and Cons Experiments performed in [17, Sections 7,8] show the promise of MLNs when compared to purely logical or probabilistic approaches; this should not be cause of surprise, since MLNs are able to generalize both. MLNs are also quite easy to work with from the user perspective, given that they introduce no new concepts, but rather use old and seemingly simple concepts in a novel fashion. Furthermore, the wide range of research elds that use them are proof of their exibility and relevance in a great number of very dierent tasks. Nevertheless, one of, if not
the
major drawback of this method is its time performance.
Witness of it is the abundance of work on speeding up inference ([18, 16, 2]) and learning ([13, 14]). The inference problem has root on how the ground Markov network is created, as it can grow very quickly even for small domains, while the learning problem is caused by the diculties of working with rst-order logic.
25
Final Words Part IV.
Final Words 10. Conclusion Markov logic networks are an hybrid approach that combines rst-order logic with Markov random elds.
They do so by simply attaching a weight to each rst order
formula and then using the weighted rst-order knowledge base as a template for constructing Markov random elds. An easy way to work with them is by using the Alchemy package with its help we can dene a knowledge base and then perform weight and structure learning, as well as inference over the resulting model. Markov Logic Networks seem to be a very active research topic. A search for the term
GoogleScholar 4
in
results in a couple thousand links on a wide range of applications and
extensions to the original scheme. Although they require some background on very dierent and traditionally separated areas (logic and probability theory), once grasped, its concepts are very intuitive to apply. Furthermore given the widespread use of its underlying methods, and its capacity to generalize them, it provides an easy way of enhancing said methods' applications.
References [1] Baxi, S. S., Dabhade, S., and Varma, P. D. A survey paper on methods of detecting human under partial occlusion.
International Journal of Engineering 2,
4
(2013), . 8 [2] Beedkar, K., Del Corro, L., and Gemulla, R. markov logic networks. In
BTW
Fully parallel inference in
(2013), pp. 205224. 9
[3] Biba, M., Ferilli, S., and Esposito, F. Protein fold recognition using markov logic networks.
In
Related Problems,
Mathematical Approaches to Polymer Sequence Analysis and
R. Bruni, Ed. Springer New York, 2011, pp. 6985. 8
[4] Chan, K., and Lam, W.
Pronoun resolution with markov logic networks.
Information Retrieval Technology.
In
Springer, 2008, pp. 153164. 8
[5] Chen, H., Ku, W.-S., Wang, H., Tang, L., and Sun, M.-T. Probabilistic inference on large-scale social networks.
--
Linkprobe:
(-), . 8
[6] Choi, C., Choi, J., Lee, E., You, I., and Kim, P. Probabilistic spatio-temporal inference for motion event understanding.
4
http://scholar.google.com
26
Neurocomputing 122, 0 (2013), 24 32.
8
Final Words
References
[7] Domingos, M. S. P.
The Alchemy Tutorial,
2010. 7, 8.1
[8] Fernández, D., Marinai, S., Lladós, J., and Fornés, A. Contextual word
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing spotting in historical manuscripts using markov logic networks. In (2013), ACM, pp. 3643. 8 [9] Geier, T., Biundo, S., Reuter, S., and Dietmayer, K. Track-person asso-
Tools with Articial Intelligence (ICTAI), 2012 IEEE 24th International Conference on (2012), vol. 1, IEEE, pp. 844
ciation using a rst-order probabilistic model. In 851. 8 [10] Genesereth, M. R., and Nilsson, N. J.
ligence,
Logical foundations of articial intel-
vol. 9. Morgan Kaufmann Los Altos, 1987. 1, 2
[11] Koller, D., Friedman, N., Getoor, L., and Taskar, B. 2 graphical models in a nutshell.
STATISTICAL RELATIONAL LEARNING -
(2007), 1. 3, 2, 3, 5, 7,
8, 4, 4.1 [12] Liu, Z., and von Wichert, G. A generalizable knowledge framework for semantic indoor mapping based on markov logic networks and data driven {MCMC}.
Generation Computer Systems -,
Future
0 (2013), . 8
[13] Lowd, D., and Domingos, P. Ecient weight learning for markov logic networks. In
Knowledge Discovery in Databases: PKDD 2007.
Springer, 2007, pp. 200211. 9
[14] Mihalkova, L., and Mooney, R. J. Bottom-up learning of markov logic network structure. In
Proceedings of the 24th international conference on Machine learning
(2007), ACM, pp. 625632. 9 [15] N., N. Alchemy: Open source ai.
http://alchemy.cs.washington.edu/,
-. 8.1
[16] Niu, F., Ré, C., Doan, A., and Shavlik, J. Tuy: scaling up statistical inference in markov logic networks using an rdbms.
Proceedings of the VLDB Endowment 4,
6 (2011), 373384. 9 [17] Richardson, M., and Domingos, P. Markov logic networks.
62,
Machine learning
1-2 (2006), 107136. 2.1, 2.2, 1, 2.3, 2.1, 3, 5, 4.1, 10, 5, 6, 1, 2, 3, 6, 6.1, 9
[18] Shavlik, J. W., and Natarajan, S.
Speeding up inference in markov logic
networks by preprocessing to reduce the size of the resulting grounded network. In
IJCAI
(2009), vol. 9, pp. 19511956. 9
[19] Stanley Kok, Parag Singla, M. R. P. D. M. S. H. P. D. L. J. W. A. N. The alchemy system for statistical relational ai: User manual, January 2010. 7 [20] Stuart Russell, P. N.
Articial Intelligence: A Modern Approach,
3rd. ed.
Prentice Hall, 2009. 2, 2.1, 2.2, 2.3
27
Final Words
References
[21] Wu, Z., Yu, Z., Guo, J., Mao, C., and Zhang, Y.
Fusion of long distance
dependency features for chinese named entity recognition based on markov logic net-
Natural Language Processing and Chinese Computing, M. Zhou, G. Zhou, Communications in Computer and Information Science. Springer Berlin Heidelberg, 2012, pp. 132142. 8
works. In
D. Zhao, Q. Liu, and L. Zou, Eds., vol. 333 of
[22] Xu, M., and Petrou, M. Learning logic rules for scene interpretation based on markov logic networks. In
Computer Vision ACCV 2009, H. Zha, R.-i. Taniguchi, Lecture Notes in Computer Science. Springer
and S. Maybank, Eds., vol. 5996 of
Berlin Heidelberg, 2010, pp. 341350. 8
28