Introduction to Markov Random Fields and Markov

Introduction to Markov Random Fields and Markov Logic Networks Mario Tambos - Matr. 8992525 April 4, 2014

The goal of this work is to give an introduction to Markov Random Fields and Markov Logic Networks, with a focus on the latter's usage in practice. First a brief description of First Order Logic will be given.

In the second

part Markov Random Fields - an undirected kind of Probabilistic Graphical Model in which the Markov Property holds - are introduced and explained in some detail, together with an example of how they work.

Then we will

see how both previous topics can be combined in the form of Markov Logic Networks. Following this, we will concern ourselves with a detailed, although simple, example in the eld of Natural Language Processing. To nalize we will see the advantages and problems of this method, in particular with its expressiveness and computational cost.

Contents 1. Introduction

2

I.

3

First Order Logic

2. Brief Explanation

II.

3

2.1.

Entities

2.2.

Syntax and Semantics

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.3.

Interpretation and Grounding . . . . . . . . . . . . . . . . . . . . . . . . .

5

Markov Random Fields

3. Basic Concepts and Denitions 3.1.

Markov Properties

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

7

7 11

1

1. Introduction

4. Inference 4.1.

12

Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

III. Markov Logic Networks

16

5. Introduction

16

6. Probability distribution

19

6.1.

Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7. Tools

20

8. Applications 8.1.

20

23

On NLP - Statistical Parsing

. . . . . . . . . . . . . . . . . . . . . . . . .

23

9. Pros and Cons

25

IV. Final Words

26

10.Conclusion

26

References

26

1. Introduction In many elds the need arises to nd a succinct representation for known facts that also allows to discover (or

infer )

new knowledge.

Logical

and

probabilistic

are two very

popular approaches to solve this problem. In the rst case, knowledge is written down as logic formulas, and new formulas (new facts) are derived by using logic equivalencies. In the second, facts are seen as random variables, over which statistical/probabilistic reasoning is applied in order to calculate the likelihood of events.

We will start this

First-order Logic, a generalization of propositional logic that falls strictly in the rst category. Then we will go on to dene Markov Random Fields, a kind of Probabilistic Graphical Model a probabilistic approach that is also able work by describing

to express propositional logic formulas. Finally we introduce Markov Logic Networks, a powerful tool that generalizes both methods. For this work previous knowledge in the basics of set theory, formal logic and probability theory is assumed.

2

First Order Logic Part I.

First Order Logic 2. Brief Explanation Although propositional logic (PL) is a very useful language, it lacks conciseness: in order to write the same fact about three dierent objects, one needs to write three dierent formulas, one for each object. So, if we want for instance to say all natural numbers are greater than zero, we would have to write

(1 > 0) ∧ (2 > 0) ∧ (3 > 0) ∧ · · ·

and so on ad

innitum. First-order logic (FOL) is a formal declarative language that tries to solve this. It builds on propositional logic, and adds several elements in order to increase its expressiveness. In other words, there are things said in rst-order logic, that cannot be said in propositional logic. Besides its expressiveness, at a conceptual level, what sets FOL appart from PL is their ontological commitments: what each assumes about the nature of reality. According to [20, p. 289]: ... propositional logic assumes that there are facts that either hold or do not hold in the world. Each fact can be in one of two states: true or false [...] First-order logic assumes more; namely, that the world consists of objects with certain relations among them that do or do not hold. First-order Logic lets us build

models

mathematical abstractions that x the truth

or falsehood of every relevant formula in the language ([20, p. 240]). Each model has a

domain, that is, a set of objects it contains ([20, p.

290]). It doesn't matter

what

this

objects are, the only restriction is that the domain is nonempty.

2.1. Entities

Note. Based on [17, pp. 109-110] and [20, p. 288,289,292]. FOL denes four types of symbols:

• Constant symbols Bob, House, Car.

represent (xed) objects in the domain, e.g.

Alice, Carol,

• Variable symbols stand for any one object in the domain, e.g. x : x ∈ {Alice, Carol, Bob, Hou • Predicate symbols describe relations among objects (e.g. F riends) or attributes of objects (e.g. Red). A predicate has a truth value F riends(Bob, Carol) and Red(Car) are true or f alse. • Function symbols dene mappings from tuples of objects to objects (e.g. M otherOf , which maps Persons (daughters) to Persons (their mothers)). A function has an

object value M otherOf (Alice)

is

Carol.

3

2.2

First Order Logic

Syntax and Semantics

formula is a sentence in the language of FOL, constructed with the four previous symbols, and a rst-order knowledge base (KB) is a set of formulas in FOL. Additionally, a

2.2. Syntax and Semantics

Note. Based on [17, pp. 109-110] and [20, pp. 294-298]. FOL provides the following rules for the use of its symbols:

•

type

A

is a subset of the objects in the domain. If a variable is typed (they may

not be), it ranges only over objects of the corresponding type, and if a constant is typed, it can only represent objects of the corresponding type. For example, the variable

•

x that ranges over people (e.g., Anna, Bob, etc.)

cannot be used with

term is any expression capable of representing an object in the domain, that is,

A

a constant, a variable, or a function applied to a tuple of terms. For example,

x, •

Car.

and

An

M otherOf (x)

atomic formula

(e.g.,

Bob,

are terms. or

atom

is a predicate symbol applied to a tuple of terms

Red(CarOf (Carol))).

Formulas are recursively constructed from atomic formulas as follows:

Denition 1. FOL formulas(Based on [17, pp. 109-110]): A positive Literal is an atomic formula; a negative literal formula. If

F1

and

F2

are formulas, then 1.-7. are also formulas:

1.

¬F1 (negation):

2.

F1 ∧ F2 (conjunction):

3.

F1 ∨ F2 (disjunction):

4.

F1 =⇒ F2 (implication):

5.

F 1 ⇐⇒ F 2 (equivalence):

6.

is a negated atomic

true i

F1 is

false;

true i both true i

F1

true i

or

F1

and

F2

is true;

F1

true i

∀x F1 (universal quantication):

F2

are true;

F2

is false or

F1

and

F2

true i

F1

is true;

have the same truth value; is true for every object

x

in the

domain; 7. and

x

∃x F1 (existential quantication):

true i

F1

is true for at least one object

in the domain.

Parentheses may be used to enforce precedence. All unquantied variables are assumed to be universally quantied, and the formulas in a KB are

1

implicitly conjoined.

Therefore, a KB can be viewed as a single large

formula .

1

4

This means: if a

KB

consists of the formulasF1 , F2 , · · ·

, Fn ,

thenKB

≡ F1 ∧ F2 ∧ · · · ∧ F n .

First Order Logic

2.3

Interpretation and Grounding

2.3. Interpretation and Grounding

Note. Based on [17, pp. 109-110]. In addition to the formulas themselves, we need something to decide whether any given given formula is true or false ([20, p. 292]). We do this by means of an

interpretation

a set of rules that species exactly which objects, relations and functions are referred to by the constant, predicate, and function symbols, that is a set of rules that gives

meaning

to the symbols.

Example 1. • Alice

We could say that:

Bob refers to the Helen is Alice's mother.

refers to the character Alice from Alice in Wonderland,

character SpongeBob from SpongeBob SquarePants and

• F riends refers to the friendship relation dened as: F riends(Alice, Carol) ≡ T rue; F riends(Alice, Bob) ≡ T rue; F riends(Carol, Bob) ≡ F alse. • M otherOf Helen.

refers to the motherhood function dened as:

M otherOf (Alice) =

Although it is possible to check facts in a FOL-KB without further renement of its formulas, for this work we will restrict ourselves to work only with KBs where: 1. All formulas are

grounded

in other words, we need to eliminate all variables

from all formulas, by replacing them for the appropriate constants. 2. All atoms have a truth value assigned to do this we recursively assign truth values following the rules in Denition 1. A

Herbrand interpretation

is an interpretation that contains this last rule. For in-

stance a Herbrand interpretation of the KB shown in Table 2.1 with constants

{Alice, Bob}

can be seen in Table 2.2

5

2.3

First Order Logic

Interpretation and Grounding

English

First-order logic

Clausal form

Weight

Friends of

∀x ∀y ∀z F r(x, y) ∧

¬F r(x, y) ∨ ¬F r(y, z) ∨ F r(x, z)

0.7

friends are

F r(y, z) =⇒ F r(x, z) F r(x, g(x)) ∨ Sm(x)

2.3

∀x Sm(x) =⇒ Ca(x)

¬Sm(x) ∨ Ca(x)

1.5

∀x ∀y F r(x, y) =⇒

[¬F r(x, y) ∨ Sm(x) ∨ ¬Sm(y)] ∧

1.1

(Sm(x) ⇐⇒ Sm(y))

[¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)]

friends Friendless

∀x ¬[∃y F r(x, y)] =⇒

people smoke

Sm(x)

Smoking causes cancer If two people are friends, either both smoke or neither does.

Table 2.1: Example of a rst-order knowledge base and MLN.

Sm()

for

Smokes(),

and

Ca()

for

Cancer()

F r() is short for F riends(),

(from [17, p. 111])

Atom

Interpretation

Atom

Interpretation

Friends(Alice,Alice)

True

Smokes(Alice)

True

Friends(Bob,Bob)

True

Smokes(Bob)

False

Friends(Alice,Bob)

False

Cancer(Alice)

False

Friends(Bob,Alice)

False

Cancer(Bob)

False

Table 2.2: Possible Herbrand interpretation for the KB shown in Table 2.1 with constants

{Alice, Bob}.

6

Markov Random Fields Part II.

Markov Random Fields Sometimes we have to deal with uncertain environments. We already have the tools to do it by means of statistics and probability theory. Their methods are, however, for some applications very verbose, and not that helpful when trying to get an overview of the interaction between dierent variables. Probabilistic graphical models try to solve this by encoding the variables and their relationships in a graph.

3. Basic Concepts and Denitions The easiest way of thinking about MRFs is as an undirected graph with weights associated to its nodes, edges or some combination thereof. According to [11, p. 6], the MRF's nodes represent variables, and its edges some notion of direct probabilistic interaction between neighboring variables. For instance, if one were to represent the probabilistic interactions of smoking habits, cancer and friendship among two persons, the graph would be something like Figure 3.1. In this graph each node represents an assertion over the person(s) and each edge represents a relationship among assertions. e.g. the relationship between their smoking habits or between smoking and cancer; the ellipse-marked cliques group the relationships we want to consider as the basic blocks of our calculations (more on this later). As we can see, each clique with it; the meaning of states. The

πi s

fi

Ci

has a weight function

are also called

and a function

factors, and we dene them as:

Denition 2. Factor (taken from Let

πi

will be explaned later. The value of

πi

fi

associated

depends on the nodes'

):

[11, p. 7]

X = {X1 , X2 , . . . , Xn } be a set of random variables. We dene a factor to V al(X) → R+, with V al(X) = Im(X1 ) × Im(X2 ) × · · · × Im(Xn ).

be a

function from

It is important to realize that these weights are

not

probabilities. These are calculated

as a product of the involved (in the evidence) weights, normalized by all the possibilities. This way of calculating probabilities over a graph is called a

clique factorization, and

we dene it formally as:

Denition 3. Clique Factorization (based on [11, pp. 7-8] and [17, pp. 108-109]): Let G be an undirected graph. A distribution PG factorizes over G if it is associated with

•

a set of cliques

factors

C1 , . . . , C k

of

G;

π1 [C1 ], . . . , πk [Ck ]

7


3. Basic Concepts and Denitions

(a): The graph without weights.

(b): The cliques sociated

(c): The cliques sociated

π3

and

C3 and C4 π4 .

with their as-

π1

and

(d): The cliques sociated

π5

and

C1 π2 .

and

C2

with their as-

C5 π6 .

and

C6

with their as-

Figure 3.1: A simple Markov network describing the probabilistic interactions of smoking

Ca F riends.

habits, cancer and friendship among two persons.

Sm

Denition 4.

is short for

Smokes

and

Fr

is short for

is short for

Cancer,

such that

PG (X = x) =

1 0 P (X = x) Z G

(3.1)

where

X = x ≡ {X1 = x1 , X2 = x2 , . . . , Xn = xn }, ∀i, Xi ∈ X ∧ xi ∈ x ∧ xi ∈ V al(Xi ) is an assignment to all the random variables in

PG0 (X = x) =

X,

k Y

πi [Ci ]

i=1 is an unnormalized measure and

Z=

X x∈V al(X)

8

PG0 (X = x)



is a normalizing constant called the partition function.

Figure 3.2: A sub-network of the one in Figure 3.1.

Example 2. with the

πi

If we consider a sub-graph, shown in Figure 3.2, of the graph in Figure 3.1,

dened as seen in the gure's tables, then we can calculate the probability

9



Figure 3.3: A sub-network of the one in Figure 3.1.

of the state

{Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1}

as follows:

0

P (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) = π3 [C3 ] × π5 [C5 ] × π6 [C6 ] = π3 [F r(A, B) = 1, Sm(A) = 0, Sm(B) = 0] × π5 [Sm(A) = 0, Ca(A) = 0] × π6 [Sm(B) = 0, Ca(B) = 0] = 1.1 × 1.5 × 1.5 = 2.475 Z=

X

P 0 (X = x)

x∈V al(X)

= P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 0) + P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) + P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 1, F r(A, B) = 0) ··· + P 0 (Ca(A) = 1, Ca(B) = 1, Sm(A) = 1, Sm(B) = 1, F r(A, B) = 1) = 133.65 10

P (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) 1 = P 0 (Ca(A) = 0, Ca(B) = 0, Sm(A) = 0, Sm(B) = 0, F r(A, B) = 1) Z 2.475 = 133.65 ≈ 0.0185


3.1

Markov Properties

X = {Ca(A), Ca(B), Sm(A), Sm(B), F r(A, B)}.

where

Sometimes working with multiplications is not very desirable, specially if computers are doing the work, since numbers can grow very large or very small quite quickly. Because of this people frequently use sums of logarithms instead, which are numerically more stable.

Denition 5. Logistic Model(based on [11, pp. 7-8] and [17, pp. 108-109]): Let be wi = ln(πi [Ci ]) and let fi be some binary function (a feature) over the of the clique

i,

state

then:

! PG0 (X

= x) = exp

X

wi fi (x)

(3.2)

i

Actually we are not restricted to only binary

fi s, but we will focus exclusively on them

for the rest of this work. Using the concept of clique factorization we now dene a Markov Random Field formally as:

Denition 6. Markov Random Field: Let

H be an undirected graph, we PH that factorizes over H.

call

H

a Markov Random Field i there is a distri-

bution

3.1. Markov Properties

Smokes(A) Smokes(B) and

Something that may come to our attention is that if we knew the state of the variable, then the state of

Cancer(A)

is independent of the state of

Conditional Independence the fact that two variables are independent of each other, only if a set of variables is known. We formalize F riends(A, B).

This is known as

this as follows:

Denition 7. Conditional Independence(taken from [11, p. 2]): X is conditionally independent from Y given Z, written as

(X ⊥ Y | Z),

i

P (X = x, Y = y|Z = z) = P (X = x|Z = z)P (Y = y|Z = z) : ∀x ∈ V al(X), y ∈ V al(Y ), z ∈ V al(Z) We see this in Figure 3.4: we observe factors

π3

and

Smokes(A) = 0,

and that observation modify the

π5 in a way that makes impossible further information Smokes(B) nor F riends(A, B), and vice versa.

about

Cancer(A)

to aect neither

In MRFs conditional independence comes in two avors: the Local and Global Markov Properties.

11


4. Inference

Sm(A) = 0. Sm(A).

Figure 3.4: The same Markov network of Figure 3.2, conditioned on can see, the factors

π3

and

π5

not longer depend on

As we

The Local Markov Property states exactly what our insight above said. That is, that any given node is conditionally independent of all other nodes in the MRF, given the

Markov blanket

node's neighbors. In MRF-lingo those neighbors are called the node's (blanket for short). Formally:

Denition 8. Local Markov Property (taken from [11, p. 9]): Let

H

denoted

X ).

be an undirected graph. Then for each node

NH (X),

is the set of neighbors of

X

X ∈ X,

the Markov blanket of X,

in the graph (those that share an edge with

We dene the local Markov independencies associated with

H

to be

I(H) = (X ⊥ X − {X} − NH (X) | NH (X)) : X ∈ X The Global Markov Property has a similar idea, but regarding the totality of the MRF's nodes. We will not treat it in this work, but you can read its details in

[11, pp. 9-10].

4. Inference Note. Based on [11, pp. 10-11]. With the word inference one means in this context the act of querying the MRF. That is, given a subset of variables

12

E⊆X

and an

instantiation an assignment of a state to

Markov Random Fields a variable

e

4.1

of those variables, our

for some other subset of variables the variable

X

given

Gibbs Sampling

evidence, we would like to know P (X = x|E = e)

X ⊆ X,

i.e. the probability of some instantiation

x

of

e.

We will focus on one type of such queries, the

Maximum A Posteriori (MAP),

in which we want to calculate not only the probability of an instantiation, but the instantiation for which that probability is maximal as well; formally:

argmaxx P (X =

x|E = e). Sadly this kind of problem is very costly to compute exactly (Theorems 2.15-16 in [11]). However we

can

save a lot of computation time if we content ourselves with an

approximation, at least in most cases.

4.1. Gibbs Sampling

Note. Based on [11, p. 24-29] and [17, pp. 116-118].

Gibbs Sampling is one way of approximating the MAP. In it we simply try to estimate the value of

Xi

given the value of its blanket,

Its procedure is quite simple: for each on

Xi 's

not using

i = 1, . . . , n,

the value of

Xi

on previous states.

randomly sample

Xi ,

conditioned

blanket; and keep resampling until convergence. The rst steps of the method

can be seen applied to our example of Figure 3.2 in Figure 4.1.

Denition 9. Sampling Probability: Using a logistic model, and taking into account that we are considering binary variables, the probability with which each blanket of

Xi

xi is sampled can be calculated based only on the Markov

as

P (Xi = xi |Bi = bi ) w f (X = x , B = b ) i i i i fj ∈Fi j j P P = exp w f (X = 0, B = b ) + exp w f (X = 1, B = b ) j j i i i j j i i i fj ∈Fi fj ∈Fi exp

where

wj

P

(4.1)

Bi is the Markov blanket of Xi , bi is a complete instantition of Bi 's variables; and fj are as dened in Subsection 5, and Fl is the set of features Xl appears in.

and

13

4.1

Gibbs Sampling

Example 3.


For instance, if we dene the unnormalized measure

P 0 (Sm(B) = x | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(π3 [C3 ])f3 (Sm(B) = x, Sm(A) = 0, F r(A, B) = 1) = exp + ln(π6 [C6 ])f6 (Sm(B) = x, Ca(B) = 1) ln(1.1)f3 (Sm(B) = x, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = x, Ca(B) = 1) P 0 (Sm(B) = 1 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(1.1)f3 (Sm(B) = 1, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = 1, Ca(B) = 1) = exp (ln(1.5)) = 1.5 P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) ln(1.1)f3 (Sm(B) = 0, Sm(A) = 0, F r(A, B) = 1) = exp + ln(1.5)f6 (Sm(B) = 0, Ca(B) = 1) = exp (ln(1.1) + ln(1.5)) = 1.65 then we can compute the rst step of Figure 4.1 as follows:

P (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) P 0 (Sm(B) = 0 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) +P 0 (Sm(B) = 1 | Sm(A) = 0, F r(A, B) = 1, Ca(B) = 1) 1.65 = ≈ 0.524 1.5 + 1.65 =

14


4.1

(a): Evidence node in gray.

(b):

First

conditioned

step on

a

Gibbs Sampling

-

random

Ca(B) and F r(A, B) evidenceSm(A) = 0.

tion of

(c): Second step - sample

(d):

ditioned on the sampled

F r(A, B) conSm(B) = 0 and

tioned on the sampled

Ca(A) condiSm(A) = 0.

(f ):

the evidence

(e):

Sm(B)

sample

initializaand the

Ca(B) condiSm(B) = 0.

Third step - sample

Sm(A) = 0.

Fourth step - sample

tioned on the evidence

Sm(B) conCa(B) = 0 and evidenceSm(A) =

Fifth step - re-sample

ditioned on the sampled

F r(A, B) = 0 0.

and the

Figure 4.1: First steps of a Gibbs samplig for our example of Figure 3.2.

15

Markov Logic Networks Part III.

Markov Logic Networks 5. Introduction Although the formulas in Table 2.1 might not

always

normally

be true in the real world, they are

true; that means our FOL-KB will always be non-satisable.

In the vast

majority of cases, it is very dicult to come up with non trivial and satisable KBs, and those KBs only reect a part of the domain knowledge. This problem could be solved if we thought about a KBs not as true/false absolutes, but more like soft assertions if some rules in a KB are broken, it doesn't become automatically unsatisable, it just become less likely. This kind of reasoning has been used to link propositional logic and probability in the form of probabilistic graphical models. Now we would like to extend the same concept to rst-order logic as well. Intuitively we introduce

Markov Logic Networks

logic and Markov random elds.

as an hybrid between rst-order

Our objective is to dene a method of describing

relationships among objects, as in rst-order logic, but making those descriptions probabilistic. We would like also to have a way of expressing that we are more sure about some descriptions than about others in our KB, so that they aect the likelihood of our KB more markedly. We dene Markov Logic Networks formally as follows:

Denition 10. (Taken from [17, p. 111])A Markov logic network (MLN) L set of pairs

(Fi , wi ),

where

Fi

is a formula in rst-order logic and

Together with a nite set of constants

ML,C 1.

is a

is a real number.

it denes a Markov network

(Equation 3.1 and Equation 3.2) as follows:

ML,C

contains one binary node for each possible grounding of each predicate ap-

pearing in 2.

C = {c1 , c2 , ..., c|C| },

wi

ML,C

L.

The value of the node is

1 if the ground atom is true, and 0 otherwise.

contains one feature for each possible grounding of each formula

1 if the ground formula is true, wi associated with Fi in L.

value of this feature is of the feature is the

We will name each

ML,C

a

and

Fi ∈ L.

0 otherwise.

The

The weight

ground Markov network.

We can create an MLN from a rst order KB by assigning a weight to each formula. In the case shown in Table 2.1, the table's last two columns form an MLN. The graphical

ML,C on the other hand follows from Denition 10: there is an edge between ML,C i the corresponding ground atoms appear together in at least one grounding of one formula in L. Thus, the atoms in each ground formula form a (not necessarily maximal) clique in ML,C . The syntax of the formulas in an MLN is the same

structure of

two nodes of

described in Part I. As [17, p. 111] puts:

16


5. Introduction

An MLN can be viewed as a template for constructing Markov networks. Given dierent sets of constants, it will produce dierent networks, and these may be of widely varying size, but all will have certain regularities in structure and parameters, given by the MLN (e.g., all groundings of the same formula will have the same weight).

Example 4.

In the case of the KB displayed in Table 2.1, using only the two last

formulas and with constants cess of constructing

ML,C

Alice

and

Bob (A

and

B

for short), we can see the pro-

in Figure 5.1. Starting from the formulas

(a): Ground predicates.

¬Sm(x) ∨ Ca(x),

(b): If two people are friends, either both smoke or neither does.

(c): Smoking causes cancer.

(d): Cliques that have associated weights.

Figure 5.1: Ground Markov network construction for the last two formulas in Table 2.1, with constants Alice (A) and Bob (B).

(¬F r(x, y)∨Sm(x)∨¬Sm(y))∧(¬F r(x, y)∨¬Sm(x)∨Sm(y)), the process can be taken appart as follows: 1. Identify ground predicates and add as nodes. In this case all the groundings of the predicates are

F r(A, A), F r(B, B), F r(A, B), F r(B, A).

2. Add an edge for each pair of grounded predicates in

¬Sm(x) ∨ Ca(x).

17


5. Introduction

3. Add an edge for each pair of grounded predicates in

(¬F r(x, y)∨Sm(x)∨¬Sm(y))∧

(¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)).

ML,C for the same two formulas as in Figure 5.1, constant Carol (C for short).

In Figure 5.2 we build another time using the additional

(a): Ground predicates.

but this

(b): If two people are friends, either both smoke or neither does.

(c): Smoking causes cancer. Figure 5.2: Another ground Markov network construction for the last two formulas in Table 2.1, this time with constants Alice (A), Bob (B) and Carol (C).

Once build, the

ML,C

can be used to infer the probability that Alice smokes given that

she's friends with Bob, that Carol has cancer given that Alice smokes, etc. We do this by rst grounding and constructing the (Herbrand) interpretation obtain by this process is what we call a

2

In the sense of Subsection 2.3.

18

possible world.

2 of the KB. What we


6. Probability distribution

6. Probability distribution From Denition 10, Equation 3.1 and Equation 3.2, the probability distribution over a possible world

x

specied by the ground Markov network

P (X = x) =

ML,C !

is given by

X 1 exp wi ni (x) Z

(6.1)

i

where

ni (x)

Fi

is the number of true groundings of

in

x

and

wi = ln(πi )

.

Now that we know how to calculate the probability distribution based on each formula's weights, we can make this assertion: a world smoke is

e2.3n

W

where

n

people with no friends don't

times less likely than a world where all friendless people smoke:

P (W | N = n) =

exp (2.3(|W |

− N ))

exp (2.3|W |)

− n)) exp (2.3|W |) exp (2.3|W |) exp (−2.3n) = exp (2.3|W |) exp (−2.3n) exp (2.3n) 1 = exp (2.3n) exp (2.3|W |) = exp (−2.3n) exp (2.3(|W |) = exp (−2.3n) P (W | N = 0) =

exp (2.3(|W |

Notice that, if we were to use rst order logic,

any

world where even a single friendless

person is non-smoker would be impossible. For the rest of this work we will make the three following assumptions. As [17, pp. 112] explains, these assumptions: [...] ensure that the set of possible worlds for

(L, C)

is nite, and that

ML,C

represents a unique, well-dened probability distribution over those worlds, irrespective of the interpretation and domain. These assumptions are quite reasonable in most practical applications, and greatly simplify the use of MLNs.

Assumption 1. Unique names (taken from [17, p. 112]). Dierent constants refer to dierent objects ([10]).

Assumption 2. Domain closure (taken from [17, p. 112]). The only objects in the

domain are those representable using the constant and function symbols in (L, C) ([10]).

Assumption 3. Known functions (taken from [17, p. 112]). For each function appearing in L, the value of that function applied to every possible tuple of arguments is known, and is an element of C .

19

6.1


Inference and Learning

Using this last assumption, we can replace functions by their values when grounding formulas. When assigning wights to the formulas, if a formula contains more than one clause, its weight is divided equally among the clauses, and a clause's weight is assigned to each of its groundings. For an algorithm to construct all groundings please read [17, p. 113]. If we think a little about it, we can see that Markov Logic Networks can encode any Markov Random Field (or Bayesian Network) and any rst-order logic KB (by setting all weights to

∞).

This is shown in

Proposition 4.2 and Proposition 4.3 of [17].

6.1. Inference and Learning Since each ground Markov network

ML,C

is a Markov random eld, we can make inference

in it as we explained in Section 4. In this context inference answers queries of the form

P (F1 | F2 , L, C) that is, what is the probability of a formula formula

F2 ,

using the MLN

L

F1 ,

given that we know the truth value of

and the set of constants

C.

One common step before making queries over MLN is

learning

the process of

adapting either the structure and/or the weights of a MLN to reect some set of data, referred to as

training data,

with the ultimate goal of being able to make accurate

inference. Although we will not go deeper into neither topic, we will oer some tools for and examples of their usage in Section 7 and Section 8. For details and formalism please refer to [17, pp. 116-120].

7. Tools Note. Based on [7] and [19].

Alchemy

is a software tool that let us dene Markov Logic Networks and perform

learning and inference tasks on them.

learnwts,

and

infer,

It provides three shell programs,

learnstruct,

that let us perform structure learning, weight learning and infer-

ence, respectively. Each command's arguments can be seen in Table 7.1 and Table 7.2. -i

Input .mln les

-o

Output .mln le

-t

Training .db les Table 7.1:

learnstruct and learnwts arguments.

Ground atoms are dened in .db (database) les. Ground atoms preceded by ! are false, by ? are unknown, and by neither are true. A .db le consists of a set of ground atoms, one per line.

20


7. Tools

-i

Input .mln les Output le containing inference results

-r

-e

Evidence .db les

-q

Query atoms (comma-separated with no space) Table 7.2:

infer arguments.

A .mln le has a declaration part and a formula part. The declaration section must contain at least one predicate, while the formulas section contains zero or more formulas. You can express an arbitrary rst-order formula in an .mln le. The syntax for logical connectors is shown in Table 7.3. Symbol

Meaning

Precedence

!N

Not.

1

^

And.

2

v

Or.

3

=>

Implies.

4

If and only if.

5

FORALL,forall,Forall

Universal quantication.

6

EXIST,exist,Exist

Existential quantication.

6

.

Hard formula.

NA

//,/**/

Comment.

NA

@,$

Reserved character.

NA

funcVar,isReturnValueOf

Reserved words. Neither variables

NA

nor predicates should start with them. Table 7.3: Logical operators and special characters/words. A lower number in the precedence column indicates a higher precedence.

Free variables are interpreted as universally quantied at the outermost level. Quantiers can be applied to more than one variable at once (e.g., forall x,y). A formula in an .mln le can be preceded by a number representing the weight of the formula. A formula can also be terminated by a period (.), indicating that it is a hard

3

formula . However, a formula cannot have both a weight and a period. A legal identier is a sequence of alphanumeric characters plus the characters - (hyphen), _ (underscore), and ' (prime); ' cannot be the rst character. Variables in formulas must begin with a lowercase letter, and constants must begin with an uppercase one. Constants may also be expressed as strings (e.g., Alice and A Course in Logic are both acceptable as constants). Alchemy replaces existentially quantied subformulas by disjunctions of all their groundings; therefore, existential quantiers (or negated universal quantiers) should be used

3

That is, a formula whose weight is

∞

21


7. Tools

with care. Types and constants can be declared in an .mln le with the following syntax:

typename >= {< constant1 >, < constant2 >, ...}, e.g., person = {Alice, Bob}. can also declare integer types, e.g., ageOf Student = {18, ..., 22}. Each declared

Smokes(y)) 1.5 (!Smokes(x) v Cancer(x)) 1.1 (!Friends(x,y) v Smokes(x) v !Smokes(y)) ^ (!Friends(x,y) v !Smokes(y) v Smokes(x)) corresponding to Equation 7.1-Equation 7.4:

¬F r(x, y) ∨ ¬F r(y, z) ∨ F r(x, z)

(7.1)

∀x ¬[∃y F r(x, y)] =⇒ Sm(x)

(7.2)

¬Sm(x) ∨ Ca(x)

(7.3)

[¬F r(x, y) ∨ Sm(x) ∨ ¬Sm(y)] ∧[¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y)] and then write .db le:

Friends(Alice, Bob) Friends(Alice, Carol) Smokes(Carol) !Cancer(Bob) We can then make inference with this les by executing:

infer -i example.mln -e evidence.db -r result.db -q Cancer,Smokes and as a result we would obtain:

22

(7.4)


8. Applications

Cancer(Alice) 0.80397 Cancer(Carol) 0.79797 Smokes(Alice) 0.99895 Smokes(Bob) 0.954955 the probabilities for cancer of Alice and Carol, and the probabilities of being a smoker of Alice and Bob.

8. Applications The applications of MLNs range from Natural Language Processing (

NLP, [4, 8, 21]) to

Computer Vision([1, 6, 9, 12, 22]), Network Analysis ([5]) and Bioinformatics ([3]). In the following section we will concern ourselves with one particular application in the eld of natural language processing.

8.1. On NLP - Statistical Parsing

Note. Based on [7, pp. 13-15]. Statistical parsing is a family of parsing methods that extends rule based parsing by adding a weight to each rule. These weights are normally learned from corpus of text. In this subsection we would like to show how to perform statistical parsing over a small subset of English using MLNs. We will dene the parsing rules as a context free grammar in Chomsky normal form. In it we are restricted to rules of the form:

A→BC A→a where

a

is a terminal symbol and

A, B ,

and

C

are non-terminals.

We now dene a very simple grammar for parsing (also simple) sentences in English:

S → NP V P N P → Adj N N P → Det N V P → V NP This grammar is composed of sentences (S ), noun and verb phrases (N P and

V P ),

adjectives (Adj ), nouns (N ), determiners (Det), and verbs (V ). We can now write this rules in FOL as follows:

NP(i,j) ^ VP(j,k) => S(i,k) // S -> NP VP Adj(i,j) ^ N(j,k) => NP(i,k) // NP -> Adj Noun

23

8.1


On NLP - Statistical Parsing

Det(i,j) ^ N(j,k) => NP(i,k) // NP -> Det N V(i,j) ^ NP(j,k) => VP(i,k) // VP -> V NP NP(i,k) ^ Det(i,j) => !Adj(i,j) where

i

and

j

are indexes between words, plus one at the beginning and one at the end

of a sentence, e.g. the words in the sentence Mary had a little lamb would be indexed as 0 Mary1 had2 a3 little4 lamb5 .

The last rule serves the purpose of solving the conict

between the second and third rule. Informally, it says that if a noun phrase results in a determiner and a noun, it cannot result in and adjective and a noun. We also need a collection of words, and rules indicating what those words are:

// Determiners Token("a",i) => Det(i,j) Token("the",i) => Det(i,j) // Adjectives Token("big",i) => Adj(i,j) Token("small",i) => Adj(i,j) // Nouns Token("dogs",i) => N(i,j) Token("dog",i) => N(i,j) Token("cats",i) => N(i,j) Token("cat",i) => N(i,j) Token("fly",i) => N(i,j) Token("flies",i) => N(i,j) // Verbs Token("chase",i) => V(i,j) Token("chases",i) => V(i,j) Token("eat",i) => V(i,j) Token("eats",i) => V(i,j) Token("fly",i) => V(i,j) Token("flies",i) => V(i,j) The only remaining problem, is that some words can fall into more than one category (in our case only y and ies). We must then make sure that such words get assigned only one category. One way to do this is to write mutual exclusion rules for each part-of-speech pair:

!Det(i,j) !Det(i,j) !Det(i,j) !Adj(i,j) !Adj(i,j) !N(i,j) v

v !Adj(i,j)//a v !N(i,j) //a v !V(i,j) //a v !N(i,j) //a v !V(i,j) //a !V(i,j) //a

word word word word word word

can can can can can can

be be be be be be

only only only only only only

one one one one one one

of of of of of of

{det., {det., {det., {adj., {adj., {noun,

adj.} noun} verb} noun} verb} verb}

We can now use Alchemy to train our MLN using some database, and use the result to make inference. For instance, if we were to input the sentence

24


9. Pros and Cons

// Sentence: Small dogs chase big flies Token("small",0) Token("dogs",1) Token("chase",2) Token("big",3) Token("flies",4) S(0,5) we would expect the inference engine to output something like

Adj(0,1) N(1,2) V(2,3) Adj(3,4) N(4,5) NP(0,2) NP(3,5) VP(2,5)

// // // // // // // //

small dogs chase big flies small dogs big flies chase big flies

For the sake of brevity we have skipped some details from the rule's denitions, in consequence this is

not

a working example. One can however get this and other more

complete working examples at [15].

9. Pros and Cons Experiments performed in [17, Sections 7,8] show the promise of MLNs when compared to purely logical or probabilistic approaches; this should not be cause of surprise, since MLNs are able to generalize both. MLNs are also quite easy to work with from the user perspective, given that they introduce no new concepts, but rather use old and seemingly simple concepts in a novel fashion. Furthermore, the wide range of research elds that use them are proof of their exibility and relevance in a great number of very dierent tasks. Nevertheless, one of, if not

the

major drawback of this method is its time performance.

Witness of it is the abundance of work on speeding up inference ([18, 16, 2]) and learning ([13, 14]). The inference problem has root on how the ground Markov network is created, as it can grow very quickly even for small domains, while the learning problem is caused by the diculties of working with rst-order logic.

25

Final Words Part IV.

Final Words 10. Conclusion Markov logic networks are an hybrid approach that combines rst-order logic with Markov random elds.

They do so by simply attaching a weight to each rst order

formula and then using the weighted rst-order knowledge base as a template for constructing Markov random elds. An easy way to work with them is by using the Alchemy package with its help we can dene a knowledge base and then perform weight and structure learning, as well as inference over the resulting model. Markov Logic Networks seem to be a very active research topic. A search for the term

GoogleScholar 4

in

results in a couple thousand links on a wide range of applications and

extensions to the original scheme. Although they require some background on very dierent and traditionally separated areas (logic and probability theory), once grasped, its concepts are very intuitive to apply. Furthermore given the widespread use of its underlying methods, and its capacity to generalize them, it provides an easy way of enhancing said methods' applications.

References [1] Baxi, S. S., Dabhade, S., and Varma, P. D. A survey paper on methods of detecting human under partial occlusion.

International Journal of Engineering 2,

4

(2013), . 8 [2] Beedkar, K., Del Corro, L., and Gemulla, R. markov logic networks. In

BTW

Fully parallel inference in

(2013), pp. 205224. 9

[3] Biba, M., Ferilli, S., and Esposito, F. Protein fold recognition using markov logic networks.

In

Related Problems,

Mathematical Approaches to Polymer Sequence Analysis and

R. Bruni, Ed. Springer New York, 2011, pp. 6985. 8

[4] Chan, K., and Lam, W.

Pronoun resolution with markov logic networks.

Information Retrieval Technology.

In

Springer, 2008, pp. 153164. 8

[5] Chen, H., Ku, W.-S., Wang, H., Tang, L., and Sun, M.-T. Probabilistic inference on large-scale social networks.

--

Linkprobe:

(-), . 8

[6] Choi, C., Choi, J., Lee, E., You, I., and Kim, P. Probabilistic spatio-temporal inference for motion event understanding.

4

http://scholar.google.com

26

Neurocomputing 122, 0 (2013), 24 32.

8

Final Words

References

[7] Domingos, M. S. P.

The Alchemy Tutorial,

2010. 7, 8.1

[8] Fernández, D., Marinai, S., Lladós, J., and Fornés, A. Contextual word

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing spotting in historical manuscripts using markov logic networks. In (2013), ACM, pp. 3643. 8 [9] Geier, T., Biundo, S., Reuter, S., and Dietmayer, K. Track-person asso-

Tools with Articial Intelligence (ICTAI), 2012 IEEE 24th International Conference on (2012), vol. 1, IEEE, pp. 844

ciation using a rst-order probabilistic model. In 851. 8 [10] Genesereth, M. R., and Nilsson, N. J.

ligence,

Logical foundations of articial intel-

vol. 9. Morgan Kaufmann Los Altos, 1987. 1, 2

[11] Koller, D., Friedman, N., Getoor, L., and Taskar, B. 2 graphical models in a nutshell.

STATISTICAL RELATIONAL LEARNING -

(2007), 1. 3, 2, 3, 5, 7,

8, 4, 4.1 [12] Liu, Z., and von Wichert, G. A generalizable knowledge framework for semantic indoor mapping based on markov logic networks and data driven {MCMC}.

Generation Computer Systems -,

Future

0 (2013), . 8

[13] Lowd, D., and Domingos, P. Ecient weight learning for markov logic networks. In

Knowledge Discovery in Databases: PKDD 2007.

Springer, 2007, pp. 200211. 9

[14] Mihalkova, L., and Mooney, R. J. Bottom-up learning of markov logic network structure. In

Proceedings of the 24th international conference on Machine learning

(2007), ACM, pp. 625632. 9 [15] N., N. Alchemy: Open source ai.

http://alchemy.cs.washington.edu/,

-. 8.1

[16] Niu, F., Ré, C., Doan, A., and Shavlik, J. Tuy: scaling up statistical inference in markov logic networks using an rdbms.

Proceedings of the VLDB Endowment 4,

6 (2011), 373384. 9 [17] Richardson, M., and Domingos, P. Markov logic networks.

62,

Machine learning

1-2 (2006), 107136. 2.1, 2.2, 1, 2.3, 2.1, 3, 5, 4.1, 10, 5, 6, 1, 2, 3, 6, 6.1, 9

[18] Shavlik, J. W., and Natarajan, S.

Speeding up inference in markov logic

networks by preprocessing to reduce the size of the resulting grounded network. In

IJCAI

(2009), vol. 9, pp. 19511956. 9

[19] Stanley Kok, Parag Singla, M. R. P. D. M. S. H. P. D. L. J. W. A. N. The alchemy system for statistical relational ai: User manual, January 2010. 7 [20] Stuart Russell, P. N.

Articial Intelligence: A Modern Approach,

3rd. ed.

Prentice Hall, 2009. 2, 2.1, 2.2, 2.3

27

Final Words

References

[21] Wu, Z., Yu, Z., Guo, J., Mao, C., and Zhang, Y.

Fusion of long distance

dependency features for chinese named entity recognition based on markov logic net-

Natural Language Processing and Chinese Computing, M. Zhou, G. Zhou, Communications in Computer and Information Science. Springer Berlin Heidelberg, 2012, pp. 132142. 8

works. In

D. Zhao, Q. Liu, and L. Zou, Eds., vol. 333 of

[22] Xu, M., and Petrou, M. Learning logic rules for scene interpretation based on markov logic networks. In

Computer Vision ACCV 2009, H. Zha, R.-i. Taniguchi, Lecture Notes in Computer Science. Springer

and S. Maybank, Eds., vol. 5996 of

Berlin Heidelberg, 2010, pp. 341350. 8

28

Introduction to Markov Random Fields and Markov

Introduction to Markov Random Fields and Markov

Suggest Documents

Collaborative filtering via sparse Markov random fields

Bayesian image classification using Markov random fields

b000q35yg6-Gaussian-Markov-Random-Fields-Applications-ebook ...

Multi-robot Markov Random Fields (Short Paper)

HIERARCHICAL MARKOV RANDOM FIELDS FOR ... - Semantic Scholar

Bayesian Transductive Markov Random Fields for

Specifying Gaussian Markov Random Fields with Incomplete ...

Sensing Capacity for Markov Random Fields

Fusing Markov Random Fields with Anatomical

From Markov Random Fields to Associative Memories and ... - CiteSeerX

combine markov random fields and marked point processes to extract ...

Global minimization of markov random fields with applications to ...

Learning to Rank using Markov Random Fields - Core

Application of Markov Random Fields to Landmine Discrimination in ...

Global minimization of markov random fields with applications to ...

Random Fractals and Markov Processes

3D data segmentation by local classification and Markov Random Fields

An explicit link between Gaussian fields and Gaussian Markov random ...

Hidden Markov Random Fields and Swarm Particles - ScienceDirect

Pairwise Markov random fields and segmentation of textured images

Adaptive Markov Random Fields for Joint Unmixing and ... - Enseeiht

Dynamic Hierarchical Markov Random Fields and their ... - Microsoft

Hidden Markov Random Fields and Particle Swarm ... - IAJIT

INTRODUCTION TO MARKOV DECISION PROCESSES