Mar 23, 2012 - 2 / 30. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering .... Boston Red Sox, their main rival.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering Alberto P´ erez Garc´ıa-Plaza, V´ıctor Fresno, Raquel Mart´ınez NLP & IR Group, Distance Learning University (UNED) CICLing 2012, New Delhi, India March 23, 2012
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 2 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Motivation
Main goal To understand how to represent web pages for clustering.
Question How to combine different page features to represent web pages?
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Motivation
Main goal To understand how to represent web pages for clustering.
Question How to combine different page features to represent web pages?
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 4 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Web Page Representation
Hypothesis A good document representation should be based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 5 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation Criteria:
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation
Criteria: Title
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation
Criteria: Title Emphasis
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation Word positions:
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation
Word positions: Preferential
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Different Criteria for Web Page Representation
Word positions: Preferential Standard
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 7 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip
(1)
Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4
(2)
Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip
(1)
Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4
(2)
Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Example: acc
Call to Arms
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 9 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Example: acc Example of rethoric title “Call to arms” is the title of a page that contains an article about the new trades made by New York Yankees baseball team and how these trades affect to Boston Red Sox, their main rival in the Major League Baseball.
Drawback Title terms are not related to document topic.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 10 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 11 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important.
Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important.
Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30
Motivation
Understanding the system
Improving the Combination
Summary
A quick glance at fcc
Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
Motivation
Understanding the system
Improving the Combination
Summary
A quick glance at fcc
Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
Motivation
Understanding the system
Improving the Combination
Summary
A quick glance at fcc
Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 15 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 16 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 18 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Dimension Reduction Analysis
Hypothesis If lsi improves mft, then the weighting function is not able to find the most representative terms.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 19 / 30
Motivation
Understanding the system
Rep. Banksearch tf-idf mft tf-idf lsi fcc mft fcc lsi Webkb tf-idf mft tf-idf lsi fcc mft fcc lsi
Improving the Combination
Avg.
S.D.
0,748 0,756 0,756 0,769
0,028 0,005 0,019 0,011
0,460 0,507 0,469 0,466
0,051 0,006 0,009 0,011
Summary
Conclusion The weighting function is not working as well as it could.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30
Motivation
Understanding the system
Rep. Banksearch tf-idf mft tf-idf lsi fcc mft fcc lsi Webkb tf-idf mft tf-idf lsi fcc mft fcc lsi
Improving the Combination
Avg.
S.D.
0,748 0,756 0,756 0,769
0,028 0,005 0,019 0,011
0,460 0,507 0,469 0,466
0,051 0,006 0,009 0,011
Summary
Conclusion Results for fcc in Webkb dataset are surprisingly bad.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 21 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Results for Criteria Analysis
Rep.\Dim. Banksearch fcc mft title emphasis frequency position
100
500
1000
2000
5000
0,723 0,626 0,586 0,689 0,310
0,757 0,646 0,671 0,715 0,525
0,768 0,632 0,674 0,720 0,538
0,765 0,634 0,685 0,724 0,599
0,768 0,639 0,693 0,731 0,608
For Banksearch, fcc get always higher values than individual criteria, so the combination works better in all cases. Frequency seems to be the best among the individual criteria.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 22 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Results for Criteria Analysis Rep.\Dim. Webkb fcc mft title emphasis frequency position
100
500
1000
2000
5000
0,453 0,432 0,415 0,441 0,301
0,472 0,433 0,431 0,460 0,283
0,475 0,404 0,433 0,460 0,317
0,468 0,488 0,465 0,468 0,281
0,475 0,479 0,489 0,446 0,286
For Webkb, fcc does not always outperform the others. Frequency is not always the best among the individual criteria. When title and emphasis could lead to a better clustering, the combination get worse.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 23 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 24 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Improving the Combination
Frequency should influence the decision more than position. IF
Title Low Low
AND
Frequency Medium Medium
AND
Emphasis Low Low
AND
Position Preferential Standard
THEN ⇒ ⇒
Importance Low No
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 25 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Extended Fuzzy Combination of Criteria (efcc)
IF
Title High High High High High Low Low Low Low Low Low
AND
Frequency
High Medium Low
AND
Emphasis High Medium Medium Low Low High High Medium Medium Low Low
AND
Position Preferential Standard Preferential Standard Preferential Standard Preferential Standard Preferential Standard
THEN ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒
Importance Very High High Medium Medium Low High Medium Medium Low Low No Very High Medium No
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 26 / 30
Motivation
Understanding the system
Improving the Combination
Summary
System Comparison
With efcc, both reduction methods get similar results. Rep. Banksearch tf-idf lsi fcc lsi efcc mft efcc lsi Webkb tf-idf lsi fcc mft efcc mft efcc lsi
Avg.
S.D.
0,756 0,769 0,760 0,758
0,005 0,011 0,014 0,013
0,507 0,469 0,532 0,483
0,006 0,009 0,032 0,000
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30
Motivation
Understanding the system
Improving the Combination
Summary
System Comparison
efcc solves the problems of fcc in Webkb. Rep. Banksearch tf-idf lsi fcc lsi efcc mft efcc lsi Webkb tf-idf lsi fcc mft efcc mft efcc lsi
Avg.
S.D.
0,756 0,769 0,760 0,758
0,005 0,011 0,014 0,013
0,507 0,469 0,532 0,483
0,006 0,009 0,032 0,000
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Table of Contents 1
Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria
2
Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria
3
Improving the Combination
4
Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 28 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
Motivation
Understanding the system
Improving the Combination
Summary
Thank You!
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 30 / 30