Tutorials

Tutorials Monday, September 7, 2009: Learning from Multi-label Data G. Tsoumakas (Aristotle University of Thessaloniki), M. L. Zhang (Hohai University), Z.-H. Zhou (Nanjing University) Language and Document Analysis: Motivating Latent Variable Models W. Buntine (Helsinki Institute of IT, NICTA) Methods for Large Network Analysis V. Batagelj (University of Ljubljana)

Friday, September 11,2009: Evaluation in Machine Learning P. Cunningham (University College Dublin) Transfer Learning for Reinforcement Learning Domains A. Lazaric (INRIA Lille), M. Taylor (University of Southern California) Graphical Models T. Caetano (NICTA)

Tutorial Chair: C. Archambeau (University College London)

Tutorial at ECML/PKDD’09 Bled, Slovenia 7 September, 2009

Learning from Multi-Label Data Grigorios Tsoumakas

Min-Ling Zhang

Zhi-Hua Zhou

Department of Informatics,

College of Computer and

Aristotle University of

Information Engineering,

Thessaloniki, Greece

Hohai University, China

LAMDA Group, National Key Laboratory for Novel Software Technology, Nanjing University, China

Outline Introduction Overview of existing techniques Advanced topics The Mulan open-source software

G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Outline

Introduction

What is multi-label learning Applications and datasets Multi-label evaluation metrics

Overview of existing techniques Advanced topics The Mulan open-source software


Outline

Introduction




The Larger Picture Data with multiple target variables What can the type of targets be?

Numerical

Ecological modeling and environmental applications Industrial applications (automobile)

Categorical targets

Binary targets Multi-class targets

Multi-Label Data

Ordinal

Combination of types


Notation for Multi-Label Data

A d-dimensional input space,

Numeric or nominal features

A set of q output labels

A multi-label dataset of m training examples

where


Multi-Label Learning Tasks (1/3)

Classification

Produce a bipartition of the set of labels into a relevant (positive) and an irrelevant (negative) set

For example, given

An unobserved instance x

produce a bipartition



Ranking

Produce a ranking (total strict order) of all labels according to relevance to the given instance

For example, given


produce a ranking

where ranking

denotes the position of label


in the


Classification and Ranking

Produce both a bipartition and a ranking of all labels Should be consistent:

For example, given


produce a bipartition and a ranking


Outline

Introduction




Applications and Datasets

(Semi) automated annotation of large object collections for information retrieval

Text/web, image, video, audio, biology

Tag suggestion in Web 2.0 systems Query categorization Drug discovery Direct marketing Medical diagnosis


Text (1/4)

News

An article concerning the Antikythera Mechanism can be categorized to Science/Technology, History/Culture

Reuters Collection Version I [Lewis et al., JMLR04]

804414 newswire stories indexed by Reuters Ltd 103 topics organized in a hierarchy, 2.6 on average 350 industries (2-level hierarchy post-produced) 296 geographic codes


Text (2/4)

Research articles

A research paper on an ensemble method for multilabel classification can be assigned to the areas Ensemble methods, Structured output prediction

Collections

OHSUMED [Hersh et al., SIGIR94]

Medical Subject Headings (MeSH) ontology

ACM-DL [Veloso et al., ECMLPKDD07]

ACM Computing Classification System (1st:11, 2nd:81 labels)

81251 Digital Library articles


Text (3/4)

EUR-Lex collection [Loza Mencia & Furnkranz, ECMLPKDD08]

19596 legal documents of the European Union (EU) Hierarchy of 3993 EUROVOC labels, 5.4 on average

EUROVOC is a multilingual thesaurus for EU documents

201 subject matters, 2.2 on average 412 directory codes, 1.3 on average

WIPO-alpha collection [Fall et al., SIGIRForum03]

World Intellectual Patents Organization (WIPO) 75000 patents 4 level hierarchy of ~5000 categories


Text (4/4)

Aviation safety reports (tmc2007)

Competition of SIAM Text Mining 2007 Workshop 28596 NASA aviation safety reports in free text form 22 problem types that appear during flights 2.2 annotations on average

Free clinical text in radiology reports (medical)

Computational Medicine Center's 2007 Medical NLP Challenge [Pestian et al., ACL07w] 978 reports, 45 labels, 1.3 labels on average


Web

Email

Enron dataset

UC Berkeley Enron Email Analysis Project 1702 examples, 53 labels, 3.4 on average 2 level hierarchy

Web pages

Hierarchical classification schemes

Open Directory Project Yahoo! Directory [Ueda & Saito, NIPS02]


Image and Video

Application

Automated annotation for retrieval

Datasets

Scene [Boutell et al., PR04]

2407 images, 6 labels, 1.1 on average

Mediamill [Snoek et al., MM06]

85 hours of video data containing Arabic, Chinese, and US broadcast news sources, recorded during November 2004 43907 frames, 101 labels, 4.4 on average


Image and Video


Audio (1/2)

Music and meta-data db of the HiFind company

450000 categorized tracks since 1999 935 labels from 16 categories (340 genre labels)

Annotation

Style, genre, musical setup, main instruments, variant, dynamics, tempo, era/epoch, metric, country, situation, mood, character, language, rhythm, popularity 25 annotators (musicians, music journalists) + supervisor Software-based annotation takes 8 min per track on average 37 annotation per track on average

A subset was used in [Pachet & Roy, TASLP09]

32,978 tracks, 632 labels, 98 acoustic features


Audio (2/2)

Emotional categorization of music

Relevant works

Dataset emotions in [Trohidis et al., ISMIR08]

593 tracks, 6 labels, 1.9 on average {happy, calm, sad, angry, quiet, amazed}

Some applications

[Li & Ogihara, ISMIR03; TMM06; Wieczorkowska et al., IIPWM06]

Song selection in mobile devices, music therapy Music recommendation systems, TV and radio programs

Acoustic data [Streich & Buhmann, ECMLPKDD08]

Construction of hearing aid instruments Labels: Noise, Speech, Music


Biology (1/2)

Applications

Automated annotation of proteins with functions

Annotation hierarchies

The Functional Catalogue (FunCat)

A tree-shaped hierarchy of annotations for the functional description of proteins from several living organisms

The Gene Ontology (GO)

A directed acyclic graph of annotations for gene products


Biology (2/2)

Datasets

Yeast [Elisseeff & Weston, NIPS02]

Phenotype (yeast) [Clare & King, ECMLPKDD01]

2417 examples, 14 labels (1st FunCat level), 4.2 on average 1461 examples, 4 FunCat levels

12 yeast datasets [Clare, PhdThesis03; Vens et al., MLJ08]

Gene expression, homology, phenotype, secondary structure FunCat, 6 levels, 492 labels, 8.8 on average GO, 14 levels, 3997 labels, 35.0 on average


Tag Suggestion in Web 2.0 Systems

Benefits

Input

Feature representation of objects (content)

Challenges

Richer descriptions of objects Folksonomy alignment

Huge number of tags Fast online predictions

Related work

[Song et al., CIKM08; Katakis et al., ECMLPKDD08w]


Query Categorization

Benefits

Integrate query-specific rich content from vertical search results (e.g. from a database) Identify relevant sponsored ads

Place ads on categories vs. keywords

Example: Yahoo! [Tang et al., WWW09]

6433 categories organized in an 8-level taxonomy 1.5 million manually labeled unique queries Labels per query range from 1 to 26


3+ labels 3% 2 labels 16%

1 label 81%

Drug Discovery

MDL Drug Data Report v. 2001

119110 chemical structures of drugs 701 biological activities (e.g. calcium channel blocker, neutral endopeptidase inhibitor, cardiotonic, diuretic)

Example: Hypertension [Kawai & Takahashi, CBIJ09]

Two major activities of hypertension drugs

Angiotensin converting enzyme inhibitor Neutral endopeptidase inhibitor

Compounds producing both these 2 specific activities found to be an effective new type of drug


Other

Direct marketing [Zhang et al., JMLR06]

A direct marketing company sends offers to clients for products of categories they potentially are interested Historical data of clients and product categories that they got interested (multiple categories) Data from Direct Marketing Association

Classification and Ranking

19 categories Send only relevant products, send top X products

Medical diagnosis

A patient may be suffering from multiple diseases at the same time, e.g. {obesity, hypertension}


Multi-Label Statistics

Label cardinality, c

Label density

Average number of labels per example Label cardinality divided by the total number of labels

Distinct labelsets

Number of different label combinations


Outline

Introduction




Evaluation Metrics: A Taxonomy

Based on calculation [Tsoumakas & Vlahavas, ECMLPKDD07]

Example-based are calculated separately for each test example and averaged across the test set Label-based are calculated separately for each label and then averaged across all labels

Based on the output of the learner

Binary prediction for each label Ranking of the labels (example-based) Probability or score for each label


Example-Based Binary (1/2)

Notation

Subset accuracy [Zhu et al., SIGIR05; Ghamrawi & McCallum, CIKM05]

, where I(true)=1 and I(false)=0

Hamming loss [Schapire & Singer, MLJ00]

Pi is the set of predicted labels for instance xi Yi is the set of actual labels for instance xi

, where is the XOR operation Average binary classification error

Accuracy [Godbole & Sarawagi, PAKDD04]


Example-Based Binary (2/2)

Information retrieval view [Godbole & Sarawagi, PAKDD04]

Precision

Recall

Harmonic mean


Example-Based Ranking (1/2)

One-error

Evaluates how many times the top-ranked label is not in the set of proper labels of the example

Coverage

Evaluates how many steps are needed, on average, to go down the label list to cover all proper labels of the example

[Schapire & Singer, MLJ00] G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Example-Based Ranking (2/2)

Ranking loss

Evaluates the average fraction of label pairs that are mis-ordered for the instance

Average precision

Evaluates the average fraction of labels ranked above a proper label which are also proper labels [Schapire & Singer, MLJ00]


Label-Based Binary

Let B(TPj, FPj, TNj, FNj)

Contingency Table for λj Learner Output

Actual Value POS

NEG

POS

TPj

FPj

NEG

FNj

TNj

be a binary evaluation measure calculated based on the above contingency table E.g. accuracy = (TPj + TNj)/(TPj + FPj + TNj + FNj)

Macro-averaging

Ordinary averaging of a binary measure

Micro-averaging

Labels as different instances of the same global label


Probabilities or Scores per Label

Bipartition can be obtained via thresholding

A ranking can be obtained after solving ties

Example-based ranking measures

A vertical ranking can be computed

Example/label-based binary measures

Ranking measures calculated vertically (label-based)

Threshold independent label-based evaluation

Only for probabilities! Area under a ROC or PR curve


Which Measure Should I Use?

Computer aided annotation by humans

Automated annotation for retrieval

Macro-averaged F-measure

Direct marketing

E.g. tag suggestion in web 2.0 systems Example-based ranking measures

Example-based precision and ranking

Query categorization

Example-based precision


Ideas

Outline Introduction Overview of existing techniques

Problem transformation methods Algorithm adaptation methods From ranking to classification

Advanced topics The Mulan open-source software






A Categorization

[Tsoumakas & Katakis, IJDWM07]

Problem transformation methods

They transform the learning task into one or more single-label classification tasks

They are algorithm independent

Some could be used for feature selection as well

Algorithm adaptation methods

They extend specific learning algorithms in order to handle multi-label data directly

Boosting, generative (Bayesian), SVM, decision tree, neural network, lazy, ……


Problem Transformation Methods Binary relevance Ranking via single-label learning Pairwise methods

Methods that combine labels

Ranking by pairwise comparison Calibrated label ranking Label Powerset, Pruned Sets

Ensemble methods

RAkEL, EPS


Example Multi-Label Dataset

= {λ1, λ2, λ3, λ4} Example Features 1 2 3 4

r x1 r x2 r x3 r x4

Label set {λ1, λ4} {λ3, λ4} {λ1} {λ2, λ3, λ4}


Binary Relevance (BR)

How it works

Limitation

Learns one binary classifier for each label Outputs the union of their predictions Can do ranking if classifier outputs scores

Ex #

Label set

1

{λ1, λ4}

2

{λ3, λ4}

3

{λ1}

4

{λ2, λ3, λ4}

Does not consider label relationships

Complexity O(qm) Ex #

λ1

Ex #

λ2

Ex #

λ3

Ex #

λ4

1

true

1

false

1

false

1

true

2

false

2

false

2

true

2

true

3

true

3

false

3

false

3

false

4

false

4

true

4

true

4

true


Ranking via Single-Label Learning

Basic concept

Transform the multi-label dataset to a single-label multiclass dataset with the labels as classes A single-label classifier that outputs a score (e.g. probability) for each class can produce a ranking

Transformations [Boutell et al, PR04; Chen et al., ICDM07]

ignore select-max, select-min, select-random copy, copy-weight (entropy)


Ignore

Simply ignore all multi-label examples

Major information loss!

Ex #

Label set

1

{λ1, λ4}

2

{λ3, λ4}

3

{λ1}

4

{λ2, λ3, λ4}

Ex #

Label set

3

{λ1}


Select Min, Max and Random

Select one of the labels

Most frequent label (Max) Less frequent label (Min) Random selection

Ex #

Label set

1

{λ1, λ4}

2

{λ3, λ4}

3

{λ1}

4

{λ2, λ3, λ4}

Information loss! Ex # Label

Ex # Label

Ex #

λ4

1

λ4

1

λ1

1

λ4

2

λ4

2

λ3

2

λ3

3

λ1

3

λ1

3

λ1

4

λ4

4

λ2

4

λ2

Max

Min


Random

Copy and Copy-Weight (Entropy)

Replace each example ( xi , Yi ) with Yi examples ( xi , λ j ) , one for each λ j ∈ Yi (xi,λj)

Copy-weight requires learners that take the weights of examples into account

Weights examples by

1 Yi

No information loss Increased examples O(mc)

Ex # Label Weight 1a

λ1

0,50

1b

λ4

0,50

2a

λ3

0,50

Ex #

Label set

2b

λ4

0,50

1

{λ1, λ4}

3

λ1

1,00

2

{λ3, λ4}

4a

λ2

0,33

3

{λ1}

4b

λ3

0,33

4

{λ2, λ3, λ4}

4c

λ4

0,33


Ranking by Pairwise Comparison (1/4)

How it works [Hullermeier et al., AIJ08]

It learns q(q-1)/2 binary models, one for each pair of labels (λ i , λ j ), 1 ≤ i < j ≤ q Each model is trained based on examples that are annotated by at least one of the labels, but not both It learns to separate the corresponding labels Given a new instance, all models are invoked and a ranking is obtained by counting the votes received by each label



Ex # 1vs2

Ex # 1vs3

Ex #

Label set

1

{λ1, λ4}

2

{λ3, λ4}

3

{λ1}

4

{λ2, λ3, λ4}

Ex # 1vs4

1

true

1

true

2

false

3

true

2

false

3

true

4

false

3

true

4

false

4

false

Ex # 2vs3 2

false


Ex # 2vs4 1

false

2

false

Ex # 3vs4 1

false

Ranking by Pairwise Comparison (3/4) new instance x' 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4 1

3

1

3

2

3

Label Votes λ1

2

λ2

1

λ3

3

λ4

0

λ3 Ranking:


λ1 λ2 λ4


Time complexity

Training: O(mqc), where c is the label cardinality

Each example x appears in |Px |(q-|Px |) < |Px |q datasets

Testing: Needs to query q2 binary models

Space complexity

Needs to maintain q2 binary models in memory Pairwise decision tree/rule learning models might be simpler than one-vs-rest Perceptrons/SVMs store a constant number of parameters


Calibrated Label Ranking (1/4)

How it works [Furnkranz et al., MLJ08]

Extends ranking by pairwise comparison by introducing an additional virtual label λV, with the purpose of separating positive from negative labels Pairwise models that include the virtual label correspond to the models of binary relevance

All examples are used When a label is true, the virtual label is considered false When a label is false, the virtual label is considered true

The final ranking includes the virtual label, which acts as the split point between positive/negative labels


Calibrated Label Ranking (2/4) Ex # 1vsV

Ex # 2vsV

Ex #

Label set

Ex # 3vsV

Ex # 4vsV

1

true

1

false

1

{λ1, λ4}

1

false

1

true

2

false

2

false

2

{λ3, λ4}

2

true

2

true

3

true

3

false

3

{λ1}

3

false

3

false

4

false

4

true

4

{λ2, λ3, λ4}

4

true

4

true

Ex # 1vs2

Ex # 1vs3

Ex # 1vs4

1

true

1

true

2

false

3

true

2

false

3

true

4

false

3

true

4

false

4

false

Ex # 2vs3 2

false


Ex # 2vs4 1

false

2

false

Ex # 3vs4 1

false

Calibrated Label Ranking (3/4) new instance x'

1vs2 1vs3 1vs4 2vs3 2vs4 3vs4 1

1

1

2

2

1vsV 2vsV 3vsV 4vsV

4

1

V

V

λ1

Label Votes λ1

4

λV

λ2

2

Ranking: λ2

λ3

0

λ4

λ4

1

λ3

λV

3


V

Calibrated Label Ranking (4/4)

Benefits

Improved ranking performance Classification and ranking (consistent)

Limitations

Space complexity (as in RPC)

A solution for perceptrons [Loza Mencıa & Furnkranz, ECMLPKDD08]

Querying q2 + q models at runtime

QWeighted algorithm [Loza Mencia et al., ESANN09]


Label Powerset (LP) (1/5)

How it works

Each different set of labels in a multi-label training set becomes a different class in a new single-label classification task Given a new instance, the single-label classifier of LP outputs the most probable class (a set of labels) Ex #

Label set

Ex #

Label

1

{λ1, λ4}

1

1001

2

{λ3, λ4}

2

0011

3

{λ1}

3

1000

4

{λ2, λ3, λ4}

4

0111



Ranking

It is possible if a classifier that outputs scores (e.g. probabilities) is used [Read, NZCSRS08] Are the bipartition and ranking always consistent? c

P(c|x)

λ1 λ2 λ3 λ4

1001

0,7 0,1

1

0

0

1

0011

0,2 0,3

0

0

1

1

1000

0,1 0,4

1

0

0

0

0111

0 0,2

0

1

1

1

ΣP(c|x)λj 0,8 0,5 0,0 0,2 0,2 0,5 0,9 0,6



Complexity

Depends on the number of distinct labelsets that exist in the training set It is upper bounded by min(m,2q) It is usually much smaller, but still larger than q

Limitations

High complexity Limited training examples for many classes Cannot predict unseen labelsets


Label Powerset (LP) (4/5) Labelsets Dataset emotions enron hifind

m

q

Bound Actual Diversity

593

6

64

27

0,42

1702

53

1702

753

0,44

32971 632 32971 32734

0,99

mediamill 43907 101 43907 medical scene tmc2007 yeast

6555

0,15

978

45

978

94

0,10

2407

6

64

15

0,23

22 28596

1341

0,05

198

0,08

28596 2417

14

2417


1 4104 2 895 3 384 4 215 5 156 6 111 7 85 8 77 9 48 10 39

Label Powerset (LP) (5/5) 10000

Number of Classes (Combinations)

mediamill 1000

100

10

1 1

10

100

Number of Appearences


1000

10000

Pruned Sets (1/2)

How it works [Read, NZCSRS08; Read et al., ICDM08]

Follows the transformation of LP, but it also … Prunes examples whose labelsets (classes) occur less times than a small user-defined threshold p (e.g. 2 or 3)

Deals with the large number of infrequent classes

Re-introduces pruned examples along with subsets of their labelsets that do exist more times than p

Strategy A: Rank subsets by size/number of examples and keep the top b of those Strategy B: Keep all subsets of size greater than b


Pruned Sets (2/2)

p=3

Strategy A, b=2

Labelset

Count

Subsets

λ1

16

λ2, λ3

2

12

λ2

14

λ1

1

16

λ2, λ3

12

λ2

1

14

λ1, λ4

8

λ3, λ4

7

λ1, λ2, λ3

2

Size Count

Strategy B, b=1

Subsets

Size

λ2, λ3

2

λ1

1

λ2

1


Random k-Labelsets (1/3)

How it works [Tsoumakas & Vlahavas, ECMLPKDD07]

Randomly break a large set of labels into a number (n) of subsets of small size (k), called k-labelsets For each of them train a multi-label classifier using the LP method Given a new instance, query models and average their decisions per label

Thresholding to obtain final model

Benefits

Computationally simpler sub-problems More balanced training sets Can predict unseen labelsets


Random k-Labelsets (2/3) model 3-labelsets h1 {λ1, λ2, λ6} h2 {λ2, λ3, λ4} h3 {λ3, λ5, λ6} h4 {λ2, λ4, λ5} h5 {λ1, λ4, λ5} h6 {λ1, λ2, λ3} h7 {λ1, λ4, λ6} average votes final prediction

predictions λ1 λ2 λ3 λ4 λ5 1 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 3/4 1/4 2/3 1/4 1/3 1 0 1 0 0


λ6 1 1 0 2/3 1

Random k-Labelsets (3/3)

Comments

The mean number of votes per label is nk/q The large it is, the higher the effectiveness It characterizes RAkEL as an ensemble method

How to set parameters k and n?

k should be small enough to deal with LP's problems n should be large enough to obtain more votes Proposed default parameters

k=3, n=2q (6 votes per label)


Ensembles of Pruned Sets

How it works [Read et al., ICDM08]

Constructs n pruned sets models by sampling the training set (e.g. 63%) Given a new instance, queries models and averages their decisions (each decision concerns all labels)

A ranking is obtained Thresholding is used to obtain bipartition






A Categorization

[Tsoumakas & Katakis, IJDWM07]

Problem transformation methods

They transform the learning task into one or more single-label classification tasks

They are algorithm independent

Some could be used for feature selection as well

Algorithm adaptation methods

They extend specific learning algorithms in order to handle multi-label data directly

Boosting, generative (Bayesian), SVM, decision tree, neural network, lazy, ……


Algorithm Adaptation Methods

Boosting

Generative (Bayesian)

BP-MLL [Zhang & Zhou, TKDE06]

Lazy (kNN)

Multi-label C4.5 [Clare & King, PKDD01]

Neural network

Rank-SVM [Elisseeff & Weston, NIPS02]

Decision tree

[McCallum, AAAI99w], [Ueda & Saito, NIPS03]

SVM

AdaBoost.MH [Schapire & Singer, MLJ00]

ML-kNN [Zhang & Zhou, PRJ07]

......


AdaBoost.MH Description: The core of a successful multi-label text categorization system BoosTexter [Schapire & Singer, MLJ00]

Basic Strategy: Map the original multi-label learning problem into a binary learning problem, which is then solved by traditional AdaBoost algorithm [Freund & Schapire, JCSS97]

Example transformation: Transform each multi-label training example

into

binary labeled examples: concatenation of xi and each label y


AdaBoost.MH (Cont’) Training procedure: Classical AdaBoost is employed to learn from the transformed binary-labeled examples iteratively

Weak hypotheses (base learners): Has the basic form of decision stump (one-level decision tree) E.g.: for text categorization task, each possible term w (e.g. bigram) specifies as follows: a weak hypothesis where x is a text document, and c0y and c1y are predicted outputs

In each boosting round, the choices of weak hypothesis as well as its combination weight is optimized towards the minimization of empirical hamming loss G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Generative (Bayesian) Approach Description: Modeling the generative procedure of multi-label texts [McCallum, AAAI99w] [Ueda & Saito, NIPS03]

Basic Assumption: The word distribution given a set of labels is determined by a mixture (linear combination) of word distributions, one for each single label Settings: word vocabulary word distribution given a single label word distribution given a set of labels q-dimensional mixture weight


for Y

Generative (Bayesian) Approach (Cont’) MAP (Maximum A Posteriori) Principle: Given a test document x*, its associated label set Y* is determined as: [applying Bayesian rule] [assuming word independence] Prior probability

Mixture of word distributions

Directly estimated from training set by frequency counting The parameters

and

are learned by EM-style procedure

Here, the two generative approaches are specific to text applications instead of general-purpose multi-label learning methods G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Rank-SVM Description: A maximum margin approach for multi-label learning, implemented with kernel trick to incorporate non-linearity [Elisseeff & Weston, NIPS02]

Basic Strategy: Assume one classifier for each individual label, and define “multi-label margin” on the whole training set, which is then minimized under QP (quadratic programming) framework Classification system: q linear classifiers

, each with weight wk and bias bk:


Rank-SVM (Cont’) Margin definition: margin for a multi-label example labels in should be ranked higher than labels not in

margin for the training set S:

QP formulation (ideal case): Solved by introducing slack variables and then optimized in its dual form (with incorporation of kernel trick)


Multi-Label C4.5 Description: An extension of popular C4.5 decision tree to deal with multi-label data [Clare & King, PKDD01]

Basic Strategy: Define “multi-label entropy” over a set of multi-label examples, based on which the information gain of selecting a splitting attribute is calculated, and then a decision tree is constructed recursively in the same way of C4.5 Multi-label entropy: Given a set of multi-label examples

, let p(y) denote the probability that an example in S has label y, then the multi-label entropy is:


BP-MLL Description: An extension of popular BP neural networks to deal with multi-label data [Zhang & Zhou, TKDE06]

Basic Strategy: Define a novel global error function capturing the characteristics of multilabel learning, i.e. labels belonging to an example should be ranked higher than those not belonging to that example Network architecture: Single-hidden-layer feed forward neural network Adjacent layers are fully connected Each input unit corresponds to a dimension of input space Each output unit corresponds to an individual label G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

BP-MLL (Cont’) Global error function: Given multi-label training set

, the global training error E

on S is defined as:

Ei: the error of the network on (xi,Yi);

cij: the actual network output on xi on the j-th label

Approximately optimizing the ranking loss criterion Lead the system to output larger values for the labels belonging to the test instance and smaller values for the labels not belonging to it

Parameter optimization: gradient descent + error back-propagation strategy G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

ML-kNN Description: An extension of popular kNN to deal with multi-label data [Zhang & Zhou, PRJ07]

Basic Strategy: Based on statistical information derived from the neighboring examples (i.e. membership counting statistic), the MAP principle is utilized to determine the label set of an unseen example Settings: the k nearest neighbors of x identified in the training set q-dimensional membership counting vector, where the l-th dimension counts the number of examples in N(x) having the l-th label the event that an example having (not) the l-th label the event that there are exactly j examples in N(x) having the l-th label G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

ML-kNN (Cont’) Procedure: Given a test example x, its associated label set Y is determined as: Identify its k nearest neighbors in the training set, i.e. N(x) Compute its membership counting vector

based on N(x)

Determine the label set using MAP principle based on

Probabilities needed: directly estimated from the training set based on frequency counting






From Ranking to Classification

BR can output numerical scores for each label

E.g. perceptrons, SVMs, decision trees, kNN, Bayes We use an intuitive threshold to go from these scores to 0/1 decisions (e.g. 0 in perceptrons, SVMs, 0.5 in probabilistic/confidence outputs)

The same applies to other problem transformation and algorithm adaptation methods Are there general approaches to deliver classification from a ranking?


From Ranking to Classification

Thresholding strategies

RCut, PCut, SCut, RTCut, SCutFBR [Yang, SIGIR01] A study of SCutFBR [Fan and Lin, TechRep07]

Learning the number of labels

Based on the ranking [Elisseeff and Weston, NIPS02] Based on the content [Tang et. al, WWW09]


Thresholding Strategies: RCut

How it works

Given a document, it sorts the labels by score and selects the top k labels, where k inside [1,q]

How to set the parameter k?

It can be specified by the user

Typically it is set to the label cardinality of the training set

It can be globally tuned using a validation set

Comments

What if the number of labels per example varies? It does not perform well in practice [Tang et. al, WWW09]


Thresholding Strategies: PCut

How it works

λ j sort test instances based on score and For each label λ, λ j to the top kj= k j =kkP (λ j ) test instances P(λj) assign λj P(λj) P(λ j ) is the prior probability of a document belonging λj to λj(estimated on the training set) k is a proportionality constant to trade-off false positives and false negatives

It can be globally tuned as in RCut

Comments

It requires the prediction scores for all test instances, so it is not suitable for online decisions


Thresholding Strategies: SCut

How it works

For each label λ λj j , tune a threshold based on a validation set

Comments

In contrast to RCut and PCut it tunes a separate parameter for each label Requires a validation phase, whose complexity is linear with respect to q Overfits when the binary learning problem is unbalanced (few positive labels)

Too high or too low thresholds


Thresholding Strategies: RTCut

How it works

Given a document, it sorts the labels by synthetic score and selects those above a threshold t It is optimized using a validation set

Synthetic score

ss (λ j ) = r (λ j ) +

s (λ j ) max j∈L {s (λ j )} + 1


Thresholding Strategies: SCutFBR

How it works

SCutFBR.0

When the SCut threshold is too high macro-F1 is hurt When the SCut threshold is too low both macro and micro-F1 are hurt (prefer to increase the threshold) Solution: When the calculated threshold is below a given value fbr then… Set the threshold to infinity

SCutFBR.1

Set the threshold to the largest score during validation


Learning the Number of Labels

[Elisseeff & Weston, NIPS02] Input: a q-dimensional feature space with the obtained scores for each label Output: The threshold t that minimizes the symmetric difference between predicted and true sets Learning: linear least squares [Tang et. al, WWW09] Input: original feature space, scores, sorted scores Output: the size of the labelset Learning: multi-class classification, with a cut-off parameter


Outline Introduction Overview of existing techniques Advanced topics

Learning in the presence of Label Structure Multi-instance multi-label learning

The Mulan open-source software






Hierarchy Types and Implications

Trees

Singe parent per label When an object is labeled with a node, it is also labeled with its parent (paths) Path types

Annotation paths end at a leaf (full path) Annotation paths end at internal nodes (partial paths)

Directed acyclic graphs (DAGs)

Multiple parents per label When an object is labeled with a node

It is also labeled with all its parents (multiple inheritance) It is also labeled with at least one of its parents


A Simple Approach

Ignore hierarchies

Simple binary relevance Should be used as a baseline Learn leaf models only, in the case of full paths!


Hierarchical Binary Relevance

Training [Koller & Sahami, ICML97; Cesa-Bianchi et al., JMLR06]

One binary model at each node, using only those examples that are annotated with the parent node

Predictions are formed top-down

A node can predict true only if its parent predicted so What about probabilities? p(λ)=p(λ|par(λ)p(par(λ))

When thresholding, the threshold for a node should not be higher than that of its parent

Comments

Handles both partial and full paths


Hierarchical Multi-Label Learning

Generalization of HBR [Tsoumakas et al., ECMLPKDD08w]

Training and testing follows same approach as HBR One multi-label learner is trained at each internal node If BR is used at each node, then we get HBR

TreeBoost.MH [Esuli et al., IR08]

Instantiation using AdaBoost.MH


Other Approaches Predictive Clustering Trees [Vens et al., MLJ08] B-SVM [Cesa-Bianchi et al., ICML06]

Train similarly to HBR Bottom-up Bayesian combination of probabilities

Bayesian networks [Barutcuoglu et al., BIOINF06]

Train independent binary classifiers for each label Combine them using a Bayesian network to correct inconsistencies






Consider … An image usually contains multiple regions each can be represented by an instance

The image can simultaneously belong to multiple classes Elephant Lion Grassland Tropic Africa ……


Consider … A document usually contains multiple sections each can be represented by an instance

The document can simultaneously belong to multiple categories

Scientific novel Jules Verne’s writing Book on traveling ……


MIML

Multi-Instance Multi-Label (MIML) Learning


Why MIML? Appropriate representation is important Having an appropriate representation is as important as having a strong learning algorithm Real-world objects are usually inherited with input ambiguity as well as output ambiguity

Traditional supervised learning, multi-instance learning and multi-label learning are degenerated versions of MIML


Why MIML? (Cont’)

Traditional supervised learning

Multi-label learning

Multi-instance learning

Multi-instance multi-label learning


Why MIML? (Cont’) To learn an one-to-many mapping is an ill-posed problem Why there are multiple labels? many-to-many mapping seems better; and moreover, MIML also offers a possibility for understanding the relationship between instances and labels label ……

instance different aspects

object instance

label …… label

instance

…… label


Why MIML? (Cont’) MIML can also be helpful for learning single-label examples involving complicated high-level concepts


Why MIML? (Cont’) MIML can also be helpful for learning single-label examples involving complicated high-level concepts


Multi-Instance Multi-Label Learning MIML task: To learn a function from a given data set , where instances , and labels .

is a set of , is a set of ,

Ξ - the instance space Ψ - the set of class labels ni - the number of instances in Xi li - the number of labels in Yi


Solving MIML by Degeneration MIMLBoost (an illustration of Solution 1)

May suffer from information loss

MIBoosting

MIL

during degeneration process!

MIML

SISL

MLSVM

Category-wise decomposition

MLL

Representation Transformation

MIMLSVM (an illustration of Solution 2) unambiguous

ambiguous


Advanced Topics Solving MIML by regularization No access to original data objects Learning single-label examples involving complicated highlevel concepts Zhou et al., MIML: A Framework for Learning with Ambiguous Objects. CORR abs/0808.3231, 2008

Large margin MIML algorithm M3MIML

[Zhang & Zhou, ICDM’08]

MIML for image annotation

[Zha et al., CVPR’08]

MIML metric learning

[Jin et al., CVPR’09]

…… G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Outline Introduction Overview of existing techniques Advanced topics The Mulan open-source software


What It Is

Mulan

Open source software => scientific progress

"better reproducibility of experimental results, quicker detection of errors, innovative applications, and faster adoption of machine learning methods in other disciplines and in industry" JMLR OSS

Built on top of Weka

An open source software for multi-label learning Current version: 1.0.1

Well established code and user base

Programming language: Java


Data Format

ARFF file

@relation MultiLabelExample @attribute @attribute @attribute @attribute @attribute @attribute @attribute @attribute

feature1 numeric feature2 numeric feature3 numeric label1 {0, 1} label2 {0, 1} label3 {0, 1} label4 {0, 1} label5 {0, 1}

@data 2.3,5.6,1.4,0,1,1,0,0

XML file

Datasets available at: http://mlkd.csd.auth.gr/multilabel.html


Hierarchies of Labels


Validity Checks All labels specified in the XML file must be also defined in the ARFF file with same name Label names must be unique Each ARFF label attribute must be nominal with binary values {0, 1} Data should be consistent with the hierarchy

If any child label appears at an example, then all parent labels must also appear


mulan.core

Exception handling infrastructure

Util

WekaException MulanException MulanRuntimeException Utilities

MulanJavadoc

Documentation support


mulan.data (1/2) MultiLabelInstances LabelsMetaData and LabelsMetaDataImpl LabelNode and LabelNodeImpl LabelsBuilder

Creates a LabelsMetaData instance from an XML file

LabelSet

Labelset representation class


mulan.data (2/2)

Converters from LibSVM and CLUS formats

ConverterLibSVM ConverterCLUS

Statistics

Number of numeric/nominal attributes, labels Cardinality, density, distinct labelsets Phi-correlation matrix, co-occurrence matrix


MultiLabelInstances

MultiLabelInstances

Instances dataSet LabelsMetaData labelsMetaData MultiLabelInstances(String arffFile, String xmlFile) MultiLabelInstances(String arffFile, int numLabels) MultiLabelInstances(Instances d, LabelsMetaData m) getDataset(): Instances getLabelsMetaData(): LabelsMetaData getLabelIndices(): int[] getFeatureIndices(): int[] getNumLabels(): int clone(): MultiLabelInstances reintegrateModifiedDataSet(Instances d): MultiLabelInstances


LabelsMetaData

LabelsMetaDataImpl

Map allLabelNodes Set rootLabelNodes; getRootLabels(): Set getLabelNames(): Set getLabelNode(String labelName): LabelNode getNumLabels(): int containsLabel(String labelName): boolean isHierarchy(): boolean addRootNode(LabelNode rootNode) removeLabelNode(String labelName): int


LabelNode

LabelNodeImpl

Set childrenNodes String name LabelNode parentNode getChildren(): Set getChildredLabels(): Set getDescendantLabels(): Set getName(): String getParent(): LabelNode hasChildren(): boolean hasParent(): boolean addChildNode(LabelNode node): boolean removeChildNode(LabelNode node): boolean


mulan.transformation

Package mulan.transformation

Common for learning and feature selection

Includes data transformation approaches BinaryRelevance LabelPowerset mulan.transformation.multiclass

Simple transformations that support complex ones

RemoveAllLabels


mulan.transformation.multiclass

MultiClassTransformation

MultiClassTransformationBase

transformInstances(MultiLabelInstances d): Instances transformInstances(MultiLabelInstances d): Instances transformInstance(Instance i): List

Copy CopyWeight Ignore SelectRandom SelectBasedOnFrequency SelectionType


mulan.classifier (1/3) MultiLabelLearner MultiLabelLearnerBase MultiLabelOutput

mulan.classifier.transformation mulan.classifier.meta mulan.classifier.lazy mulan.classifier.neural


mulan.classifier (2/3)

MultiLabelLearner

build(MultiLabelInstances instances) makePrediction(Instance instance): MultiLabelOutput makeCopy: MultiLabelLearner

MultiLabelLearnerBase

int numLabels int[] labelIndices, featureIndices boolean isDebug build(MultiLabelInstances instances) buildInternal(MultiLabelInstances instances) makeCopy debug(String msg) setDebug(boolean debug) getDebug: boolean


mulan.classifier (3/3)

MultiLabelOutput

boolean[] bipartition double[] confidences int[] ranking


mulan.classifier.transformation

TransformationBasedMultiLabelLearner

BinaryRelevance CalibratedLabelRanking

Classifier baseClassifier TransformationBasedMultiLabelLearner TransformationBasedMultiLabelLearner(Classifier base) getBaseClassifier: Classifier

boolean useStandardVoting

LabelPowerset PPT MultiLabelStacking MultiClassLearner


mulan.classifier.meta

MultiLabelMetaLearner

MultiLabelLearner baseLearner MultiLabelMetaLearner MultiLabelMetaLearner(MultiLabelLearner base) getBaseLearner: MultiLabelLearner

RAkEL HOMER HMC


Lazy and Neural Methods

Package mulan.classifier.lazy

MultiLabelKNN

Abstract base class for kNN based methods

Implemented algorithms

MLkNN [Matlab version at http://lamda.nju.edu.cn/datacode/MLkNN.htm]

BRkNN IBLR_ML [Cheng & Hullermeier, MLJ09]

Package mulan.classifier.neural

BPMLL [Matlab version at http://lamda.nju.edu.cn/datacode/BPMLL.htm]

Package mulan.classifier.neural.model

Support classes


Evaluation (1/6)

Evaluator

Evaluator() Evaluator(int seed) evaluate(MultiLabelLearner learner, MultiLabelInstances test): Evaluation crossValidate(MultiLabelLearner learner, MultiLabelInstances test, int numFolds): Evaluation


Evaluation (2/6)

Evaluation

getExampleBasedMeasures(): ExampleBasedMeasures getLabelBasedMeasures(): LabelBasedMeasures getConfidenceLabelBasedMeasures(): ConfidenceLabelBasedMeasures getRankingBasedMeasures(): RankingBasedMeasures getHierarchicalMeasures(): HierarchicalMeasures toString() toCSV()


Evaluation (3/6)

ExampleBasedMeasures

ExampleBasedMeasures(MultiLabelOutput[] output, boolean[][] trueLabels, double forgivenessRate) ExampleBasedMeasures(MultiLabelOutput[] output, boolean[][] trueLabels) ExampleBasedMeasures(ExampleBasedMeasures[] array) getSubsetAccuracy(): double getAccuracy(): double getHammingLoss(): double getPrecision(): double getRecall(): double getFMeasure(): double


Evaluation (4/6)

LabelBasedMeasures

LabelBasedMeasures(MultiLabelOutput[] output, boolean[][] trueLabels) LabelBasedMeasures(LabelBasedMeasures[] array) getLabelAccuracy(int label): double getLabelPrecision(int label): double getLabelRecall(int label): double getLabelFMeasure(int label): double getAccuracy(Averaging type): double getPrecision(Averaging type): double getRecall(Averaging type): double getFMeasure(Averaging type): double

Averaging

MACRO, MICRO


Evaluation (5/6)

RankingBasedMeasures

RankingBasedMeasures(MultiLabelOutput[] output, boolean[][] trueLabels) RankingBasedMeasures(ExampleBasedMeasures[] array) getAvgPrecision(): double getOneError(): double getRankingLoss(): double getCoverage(): double


Evaluation (6/6)

ConfidenceLabelBasedMeasures

ConfidenceLabelBasedMeasures(MultiLabelOutput[] output, boolean[][] trueLabels) ConfidenceLabelBasedMeasures( ConfidenceLabelBasedMeasures[] array) getLabelAUC(int label): double getAUC(Averaging type): double

HiearchicalMeasures

HierarchicalMeasures(MultiLabelOutput[] output, boolean[][] trueLabels, LabelsMetaData metaData) HiearchicalMeasures(HierachicalMeasures[] array) getHierarchicalLoss: double


Examples

Package mulan.examples

TrainTestExperiment CrossValidationExperiment EstimationOfStatistics GettingPredictionsOnTestSet AttributeSelectionTest


An Example MultiLabelInstances train, test; train = new MultiLabelInstances("yeast-train.arff", "yeast.xml"); test = new MultiLabelInstances("yeast-test.arff", "yeast.xml"); Classifier base = new NaiveBayes(); BinaryRelevance br = new BinaryRelevance(base); br.build(train); Evaluator eval = new Evaluator(); Evaluation results = eval.evaluate(br, test); System.out.println(results.toString()); G. Tsoumakas, M.-L. Zhang, Z.-H. Zhou, Tutorial at ECML/PKDD’09

Mulan Credits

Main co-developers

Robert Friberg Lefteris Spyromitros Jozef Vilcek

Code contributors

Thank you!

Stavros Bakirtzoglou Weiwei Cheng Ioannis Katakis Sang-Hyeun Park Elise Rairat George Saridis George Traianos


Bibliography

An online multi-label learning bibliography is maintained at

http://www.citeulike.org/group/7105/tag/multilabel Currently includes more than 90 articles

You can…

Grab BibTeX and RIS records Subscribe to the corresponding RSS feed Follow links to the papers' full pdf (may require access to digital libraries) Export the complete bibliography for BibTeX or EndNote use (requires CiteULike account)


Tutorials

Tutorials

Suggest Documents

Available Tutorials Training Tutorials

Tutorials

Tutorials

Tutorials

Tutorials

MATLAB Tutorials

JMD Tutorials

CFX Tutorials

Tutorials - ManageEngine

GIS Tutorials

Thomas Tutorials

SynaptiCAD Tutorials

Tutorials

tutorials speakers

PA Tutorials

Arduino - Tutorials

Tutorials

Pro / DESKTOP Tutorials - ETP

Manual And Tutorials

FREE VIDEO TUTORIALS - WordPress.com

Surfer Tutorials - Innovative GIS

Business Analyst Tutorials - Pdfsdocuments.com

Advanced Level Tutorials

The NVivo 2 tutorials