Fuzzy Combinations of Criteria: An Application to ... - Semantic Scholar

3 downloads 145 Views 3MB Size Report
Mar 23, 2012 - 2 / 30. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering .... Boston Red Sox, their main rival.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering Alberto P´ erez Garc´ıa-Plaza, V´ıctor Fresno, Raquel Mart´ınez NLP & IR Group, Distance Learning University (UNED) CICLing 2012, New Delhi, India March 23, 2012

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 2 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Motivation

Main goal To understand how to represent web pages for clustering.

Question How to combine different page features to represent web pages?

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Motivation

Main goal To understand how to represent web pages for clustering.

Question How to combine different page features to represent web pages?

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 4 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Web Page Representation

Hypothesis A good document representation should be based on how humans read documents.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 5 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation Criteria:

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation 



Criteria: Title 

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation 





Criteria: Title Emphasis  

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation Word positions:

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation 



Word positions: Preferential 

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Different Criteria for Web Page Representation 





Word positions: Preferential Standard 

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 7 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip

(1)

Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4

(2)

Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip

(1)

Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4

(2)

Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Example: acc

Call to Arms

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 9 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Example: acc Example of rethoric title “Call to arms” is the title of a page that contains an article about the new trades made by New York Yankees baseball team and how these trades affect to Boston Red Sox, their main rival in the Major League Baseball.

Drawback Title terms are not related to document topic.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 10 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 11 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Nonlinear Combination of Criteria

Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.

2

A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.

2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Nonlinear Combination of Criteria

Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.

2

A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.

2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Nonlinear Combination of Criteria

Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.

2

A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.

2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important.

Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important.

Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30

Motivation

Understanding the system

Improving the Combination

Summary

A quick glance at fcc

Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30

Motivation

Understanding the system

Improving the Combination

Summary

A quick glance at fcc

Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30

Motivation

Understanding the system

Improving the Combination

Summary

A quick glance at fcc

Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 15 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 16 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Basic Clustering Settings

We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 18 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Dimension Reduction Analysis

Hypothesis If lsi improves mft, then the weighting function is not able to find the most representative terms.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 19 / 30

Motivation

Understanding the system

Rep. Banksearch tf-idf mft tf-idf lsi fcc mft fcc lsi Webkb tf-idf mft tf-idf lsi fcc mft fcc lsi

Improving the Combination

Avg.

S.D.

0,748 0,756 0,756 0,769

0,028 0,005 0,019 0,011

0,460 0,507 0,469 0,466

0,051 0,006 0,009 0,011

Summary

Conclusion The weighting function is not working as well as it could.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30

Motivation

Understanding the system

Rep. Banksearch tf-idf mft tf-idf lsi fcc mft fcc lsi Webkb tf-idf mft tf-idf lsi fcc mft fcc lsi

Improving the Combination

Avg.

S.D.

0,748 0,756 0,756 0,769

0,028 0,005 0,019 0,011

0,460 0,507 0,469 0,466

0,051 0,006 0,009 0,011

Summary

Conclusion Results for fcc in Webkb dataset are surprisingly bad.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 21 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Results for Criteria Analysis

Rep.\Dim. Banksearch fcc mft title emphasis frequency position

100

500

1000

2000

5000

0,723 0,626 0,586 0,689 0,310

0,757 0,646 0,671 0,715 0,525

0,768 0,632 0,674 0,720 0,538

0,765 0,634 0,685 0,724 0,599

0,768 0,639 0,693 0,731 0,608

For Banksearch, fcc get always higher values than individual criteria, so the combination works better in all cases. Frequency seems to be the best among the individual criteria.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 22 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Results for Criteria Analysis Rep.\Dim. Webkb fcc mft title emphasis frequency position

100

500

1000

2000

5000

0,453 0,432 0,415 0,441 0,301

0,472 0,433 0,431 0,460 0,283

0,475 0,404 0,433 0,460 0,317

0,468 0,488 0,465 0,468 0,281

0,475 0,479 0,489 0,446 0,286

For Webkb, fcc does not always outperform the others. Frequency is not always the best among the individual criteria. When title and emphasis could lead to a better clustering, the combination get worse.

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 23 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 24 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Improving the Combination

Frequency should influence the decision more than position. IF

Title Low Low

AND

Frequency Medium Medium

AND

Emphasis Low Low

AND

Position Preferential Standard

THEN ⇒ ⇒

Importance Low No

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 25 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Extended Fuzzy Combination of Criteria (efcc)

IF

Title High High High High High Low Low Low Low Low Low

AND

Frequency

High Medium Low

AND

Emphasis High Medium Medium Low Low High High Medium Medium Low Low

AND

Position Preferential Standard Preferential Standard Preferential Standard Preferential Standard Preferential Standard

THEN ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒

Importance Very High High Medium Medium Low High Medium Medium Low Low No Very High Medium No

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 26 / 30

Motivation

Understanding the system

Improving the Combination

Summary

System Comparison

With efcc, both reduction methods get similar results. Rep. Banksearch tf-idf lsi fcc lsi efcc mft efcc lsi Webkb tf-idf lsi fcc mft efcc mft efcc lsi

Avg.

S.D.

0,756 0,769 0,760 0,758

0,005 0,011 0,014 0,013

0,507 0,469 0,532 0,483

0,006 0,009 0,032 0,000

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30

Motivation

Understanding the system

Improving the Combination

Summary

System Comparison

efcc solves the problems of fcc in Webkb. Rep. Banksearch tf-idf lsi fcc lsi efcc mft efcc lsi Webkb tf-idf lsi fcc mft efcc mft efcc lsi

Avg.

S.D.

0,756 0,769 0,760 0,758

0,005 0,011 0,014 0,013

0,507 0,469 0,532 0,483

0,006 0,009 0,032 0,000

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Table of Contents 1

Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria

2

Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria

3

Improving the Combination

4

Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 28 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30

Motivation

Understanding the system

Improving the Combination

Summary

Thank You!

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 30 / 30

Suggest Documents