Abstract As Web search is becoming a routine activity in our daily lives, users scale up their expectations concerning Search Quality. This comprises factors such as accuracy, coverage and usability of the overall system. In this thesis, I describe quantitative strategies to improving search quality from two complementary perspectives, link structure and text structure, which are key topics in the field of Web Information Retrieval. I utilize some fundamental properties of the Web, presumably of human behavior after all, that are theoretically justified as well as relatively easy to apply. Link Structure. Humans do not create or follow links to Web pages arbitrarily. In fact, most of the links refer to own pages (at host-level), a fact that I exploit for simplifying the PageRank computation, particularly the principal Eigenvector of the corresponding link matrix. Also, apparently humans seem to likely link to pages that are relevant to the originating document. I present a corresponding method for automatically identifying a topic of a text query solely based on link structure, utilizing multiple topic-specific PageRank vectors. Text Structure. Humans also do not create or read text on Web pages arbitrarily. I show that the creation process of Web text is governed by a statistical law that corroborates the Quantitative Linguistic theory, yet I extend current models by the following notions: text on Web pages can be separated into blocks of “short text” and blocks of “long text”, depending on the number of words contained in the block. A large amount of actual “full text” is attributed to the class of long text whereas short text appears to mainly cover the navigational text fragments usually referred to as “boilerplate”. I present a simple, yet very effective strategy that utilizes this property for accurate main content extraction, ranking and classification. As an attempt to unification, I conclude that the processes of browsing HTML pages and of creating HTML text can be seen as a combination of two orthogonal motivations. This perspective not only facilitates highly efficient and effective algorithms, it also aids in understanding the corresponding human behavior.



Zusammenfassung Mit der wachsenden Bedeutung des World Wide Web im t¨aglichen Leben steigt auch die Erwartungshaltung gegen¨ uber Suchmaschinen und deren Qualit¨at. Dies umfasst Aspekte wie z.B. Treffergenauigkeit, Abdeckung und Nutzbarkeit (Usability) des Gesamtsystems. In der vorliegenden Dissertation beschreibe ich quantitative Strategien zur Verbesserung der Suchqualit¨ at aus zwei sich erg¨ anzenden Perspektiven, Linkstruktur und Textstruktur, zwei Kernthemen im Bereich des Web Information Retrieval. Hierbei betrachte und nutze ich einige fundamentale Eigenschaften des Web (und vermutlich des menschlichen Verhaltens im Allgemeinen), welche theoretisch fundiert und zugleich relativ einfach anwendbar sind. Linkstruktur. Menschen setzen und folgen Hyperlinks auf Webseiten nicht willk¨ urlich. In der Tat ist es so, dass ein Großteil auf eigene Seiten zeigt (auf Host-Ebene). Diese Eigenschaft nutze ich f¨ ur eine Vereinfachung der PageRank-Berechnung, bei der der Haupteigenvektor der dazugeh¨ origen Link-Matrix gesucht wird. Es hat sich gezeigt, dass Links h¨ aufig dann gesetzt werden, wenn die verbundenen Seiten thematisch zusammen h¨angen. Diese Eigenschaft nutze ich, um, nur mittels Linkstruktur und themenspezifischen PageRankVektoren, zu einer Freitext-Suchanfrage automatisch relevante Themen zu finden. Textstruktur. Menschen setzen und lesen auch Text auf Webseiten nicht willk¨ urlich. Ich zeige, dass der Erzeugungsprozess von Text im Web beschrieben werden kann durch ein statistisches Textgesetz, welches im Einklang mit Erkenntnissen aus der quantitativlinguistischen Texttheorie steht. Hierbei erweitere ich jedoch bestehende Modelle wiefolgt: Text im Web besteht aus zweierlei Arten von Bl¨ocken, jene mit kurzem Text und solche mit langem Text, abh¨ angig von der Anzahl der eingeschlossenen W¨orter. Ein Großteil des eigentlichen Haupttext einer Webseite kann mit Langtext beschrieben werden, wohingegen Kurztext haupts¨ achlich die navigationsspezifischen Textfragmente, den sogenannten “Boilerplate”, beschreibt. Diese textuelle Gesetzm¨aßigkeit mache ich mit Hilfe einer einfachen aber effektive Strategie zum akkuraten Extrahieren von Text, zum Ranking und zur Klassifikation von Webseiten nutzbar. Als Versuch einer Vereinheitlichung schlussfolgere ich, dass die Prozesse der Erzeugung bzw. Rezeption von HTML Links und Text als Kombination zweier orthogonaler Motivationen beschrieben werden k¨onnen. Diese Perspektive erlaubt nicht nur hocheffektive Algorithmen, sie erm¨ oglicht auch ein besseres Verst¨andnis menschlichen Verhaltens.



Acknowledgments I am grateful to have had numerous inspiring discussions with many people who shared their insights with me on a variety of topics, which eventually led to the present thesis. I would like to thank them all. First and foremost, I would like to express my deepest gratitude to my supervisor Professor Dr. Wolfgang Nejdl for giving me the opportunity to conduct my thesis research, for his advice, guidance and profound support throughout this work. I am very thankful to also have Professors Dr.-Ing. Bernardo Wagner and Dr.-Ing. Markus Fidler in the thesis committee, spending their time on my dissertation. I also owe Professors Dr. Gabriel Altmann and Dr. Reinhard K¨ohler a special debt of gratitude for providing a plethora of excellent work in the field of Quantitative Linguistics, and for helping me, by their publications as well as by private correspondence, to deeper understand the problem domain from a complementary perspective. Their unparalleled help to introduce me to the community of Quantitative Linguistics deserves my deepest respect. Special thanks go to Dr. Peter Fankhauser for dragging my attention to Machine Learning, for deep and insightful comments and great discussions. The members, colleagues and former colleagues of the L3S Research Center deserve many thanks for providing a stimulating and fun environment, for fruitful discussions and for interesting collaborations at research and project work. My research was finally made possible by having a full-time position at L3S as a research associate, which was mainly funded by the European Commission’s FP6/FP7 projects NEPOMUK, ELEONET and SYNC3 and so, indirectly, by the taxpayers. Many thanks also go to the German National Merit Foundation (Studienstiftung des deutschen Volkes) for their conceptual support during my studies. Finally, I would like to thank my parents for their support and encouragement and for giving me the opportunity, volition and stimulus to seek challenges in the academia. Most importantly, I thank my wife, Anastasiya, for her love, exceptional patience and support as well as for her continuous belief in me and my work. Christian Kohlsch¨ utter ix


Verteilungen der Satzl¨angen (Distribution of Sentence

Lengths). In K.-P. Schulz, editor, Glottometrika 9. Brockmeyer, 1988. [3] Gabriel Altmann. Das Problem der Datenhomogenit¨at. In Glottometrika 13, pages 287–298. Brockmeyer, Bochum, 1992. [4] Gabriel Altmann. Quantitative Linguistics - An International Handbook, chapter Diversification processes. de Gruyter, 2005. [5] Gabriel Altmann and Violetta Burdinski. Towards a Law of Word Repetitions in Text-Blocks. In U. Strauss W. Lehfeldt, editor, Glottometrika 4, volume 14 of Quantitative Linguistics, pages 147–167, Bochum, 1982. Brockmeyer. [6] Aris Anagnostopoulos, Andrei Z. Broder, and David Carmel. Sampling searchengine results. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 245–256, New York, NY, USA, 2005. ACM. ISBN 1-59593-046-9. doi: http://doi.acm.org/10.1145/1060745.1060784. [7] Apostolos Antonacopoulos, Basilios Gatos, and David Bridson. Page segmentation competition. Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, 2:1279–1283, 23-26 Sept. 2007. ISSN 15205363. doi: 10.1109/ICDAR.2007.4377121. 123



[8] Arvind Arasu, Jasmine Novak, Andrew Tomkins, and John Tomlin. PageRank Computation and the Structure of the Web: Experiments and Algorithms, 2001. [9] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. ISBN 020139829X. [10] Shumeet Baluja. Browsing on Small Screens: Recasting Web-Page Segmentation into an Efficient Machine Learning Framework. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 33– 42, New York, NY, USA, 2006. ACM.

ISBN 1-59593-323-9.

doi: http:

//doi.acm.org/10.1145/1135777.1135788. [11] Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In WWW ’02: Proceedings of the 11th international conference on World Wide Web, pages 580–591, New York, NY, USA, 2002. ACM. ISBN 1-58113-449-5. doi: http://doi.acm.org/10.1145/511446.511522. [12] Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. Cleaneval: a competition for cleaning web pages. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, and Daniel Tapias, editors, Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), 2008. ISBN 2-9517408-4-0. [13] Ori Ben-Yitzhak, Nadav Golbandi, Nadav Har’El, Ronny Lempel, Andreas Neumann, Shila Ofek-Koifman, Dafna Sheinwald, Eugene Shekita, Benjamin Sznajder, and Sivan Yogev. Beyond basic faceted search. In WSDM ’08: Proceedings of the international conference on Web search and web data mining, pages 33–44, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-927-9. doi: http://doi.acm.org/10.1145/1341531.1341539. [14] Tim Berners-Lee and Dan Conolly. RFC 1866: Hypertext Markup Language 2.0, November 1995.



[15] Karl-Heinz Best. Quantitative Linguistics - An International Handbook, chapter Satzl¨ange (Sentence length), pages 298–304. de Gruyter, 2005. [16] Karl-Heinz Best. Sprachliche Einheiten in Textbl¨ocken. In Glottometrics 9, pages 1–12. RAM Verlag, L¨ udenscheid, 2005. [17] Krishna Bharat, Bay-Wei Chang, Monika Rauch Henzinger, and Matthias Ruhl. Who links to whom: Mining linkage between web sites. In Proc. of the IEEE Intl. Conf. on Data Mining, pages 51–58, 2001. ISBN 0-7695-1119-8. [18] Sergey Brin, Rajeev Motwani, Lawrence Page, and Terry Winograd. What can you do with a web in your pocket? Data Engineering Bulletin, 21(2):37–47, 1998. URL citeseer.ist.psu.edu/brin98what.html. [19] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. In Proc. of the 9th international World Wide Web conference, pages 309–320. North-Holland Publishing Co., 2000. doi: http://dx.doi.org/ 10.1016/S1389-1286(00)00083-9. URL http://www9.org/w9cdrom/160/160. html. [20] Andrei Z. Broder, Ronny Lempel, Farzin Maghoul, and Jan Pedersen. Efficient PageRank Approximation via Graph Aggregation. In Proc. of the 13th International World Wide Web Conference, pages 484–485, 2004. ISBN 1-58113-912-8. [21] Michael K. Buckland. What is a “document”? Journal of the American Society for Information Science, 48:804–809, September 1997. [22] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Extracting Content Structure for Web Pages Based on Visual Representation. In X. Zhou, Y. Zhang, and M. E. Orlowska, editors, APWeb, volume 2642 of LNCS, pages 406–417. Springer, 2003. ISBN 3-540-02354-2. [23] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Block-based web search. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR



conference on Research and development in information retrieval, pages 456– 463, New York, NY, USA, 2004. ACM. ISBN 1-58113-881-4. doi: http://doi. acm.org/10.1145/1008992.1009070. [24] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. Page-level Template Detection via Isotonic Smoothing. In WWW ’07: Proc. of the 16th int. conf. on World Wide Web, pages 61–70, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-654-7. doi: http://doi.acm.org/10.1145/1242572.1242582. [25] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graph-theoretic approach to webpage segmentation. In WWW ’08: Proceeding of the 17th international conference on World Wide Web, pages 377–386, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-085-2. doi: http://doi.acm.org/10.1145/ 1367497.1367549. [26] Ming Chen, Xiaoqing Ding, and Jian Liang. Analysis, understanding and representation of chinese newspaper with complex layout. Image Processing, 2000. Proceedings. 2000 International Conference on, 2:590–593 vol.2, 2000. doi: 10.1109/ICIP.2000.899500. [27] Yen-Yu Chen, Qingqing Gan, and Torsten Suel. I/O-efficient Techniques for Computing PageRank. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 549–557, New York, NY, USA, 2002. ACM. ISBN 1-58113-492-4. doi: http://doi.acm.org/10. 1145/584792.584882. [28] Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. Detecting web page structure for adaptive viewing on small form factor devices. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 225–233, New York, NY, USA, 2003. ACM. ISBN 1-58113-680-3. doi: http://doi.acm.org/10. 1145/775152.775184.



[29] Paul Alexandru Chirita, Wolfgang Nejdl, Raluca Paiu, and Christian Kohlsch¨ utter. Using ODP metadata to personalize search. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178–185, New York, NY, USA, 2005. ACM. ISBN 1-59593-034-5. doi: http://doi.acm.org/10.1145/1076034. 1076067. [30] Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases, 2000. URL citeseer.ist.psu.edu/ cho00evolution.html. [31] Bruce Croft, Donald Metzler, and Trevor Strohman. Search Engines: Information Retrieval in Practice. Addison-Wesley Publishing Company, USA, 2009. ISBN 0136072240, 9780136072249. [32] Jeffrey Dean and Monika R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands), 31(11–16):1467– 1479, 1999. URL citeseer.ist.psu.edu/dean99finding.html. [33] Sandip Debnath, Prasenjit Mitra, Nirmal Pal, and C. Lee Giles. Automatic identification of informative sections of web pages. IEEE Trans. on Knowledge and Data Engineering, 17(9):1233–1246, 2005. ISSN 1041-4347. doi: http: //doi.ieeecomputersociety.org/10.1109/TKDE.2005.138. [34] Lukasz Debowski. Zipf’s law against the text size: a half-rational model. In Glottometrics 4, pages 49–60. RAM Verlag, L¨ udenscheid, 2002. [35] William Denton.

How to make a faceted classification and put it on the

web. http://www.miskatonic.org/library/facet-web-howto.pdf, November 2003. [36] William Denton. Putting facets on the web: An annotated bibliography. http: //www.miskatonic.org/library/facet-biblio.html, October 2003.



[37] Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. Ranking the web frontier. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 309–318, New York, NY, USA, 2004. ACM. ISBN 1-58113844-X. doi: http://doi.acm.org/10.1145/988672.988714. [38] Mehmet S. Aktas et al. Personalizing pagerank based on domain profiles. In WEBKDD’04, Seattle, USA, pages 83–90, August 2004. URL citeseer.ist. psu.edu/708503.html. [39] Pavel Calado et al. Link-based similarity measures for the classification of web documents. J. Am. Soc. Inf. Sci. Technol., 57(2):208–221, 2006. ISSN 15322882. doi: http://dx.doi.org/10.1002/asi.v57:2. [40] Soumen Chakrabarti et al. Enhanced hypertext categorization using hyperlinks. In SIGMOD ’98, pages 307–318, New York, NY, US, 1998. ACM Press. ISBN 0-89791-995-5. doi: http://doi.acm.org/10.1145/276304.276332. [41] Stefan Evert. A lightweight and efficient tool for cleaning web pages. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, and Daniel Tapias, editors, Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco, may 2008. European Language Resources Association (ELRA). ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/. [42] Fariza Fauzi, Jer-Lang Hong, and Mohammed Belkhatir. Webpage segmentation for extracting images and their surrounding contextual information. In MM ’09: Proceedings of the seventeen ACM international conference on Multimedia, pages 649–652, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-608-3. doi: http://doi.acm.org/10.1145/1631272.1631379. [43] David Fernandes, Edleno S. de Moura, Berthier Ribeiro-Neto, Altigran S. da Silva, and Marcos Andr´e Gon¸calves. Computing block importance for searching on web sites. In CIKM ’07, pages 165–174, 2007. ISBN 978-1-59593-803-9. doi: http://doi.acm.org/10.1145/1321440.1321466.



[44] Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. Introducing and evaluating ukWaC, a very large Web-derived corpus of English. In Proceedings of the WAC4 Workshop at LREC 2008. [45] Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Fact or fiction: Content classification for digital libraries. Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001. [46] Santo Fortunato, Mari´an Bogu n´a, Alessandro Flammini, and Filippo Menczer. Approximating pagerank from in-degree. pages 59–71, 2008. doi: http://dx. doi.org/10.1007/978-3-540-78808-9 6. [47] David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of web page templates. In WWW ’05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 830–839, New York, NY, USA, 2005. ACM. ISBN 1-59593-051-5. doi: http://doi.acm.org/10.1145/ 1062745.1062763. [48] John Gibson, Ben Wellner, and Susan Lubar. Adaptive web-page content identification. In WIDM ’07: Proceedings of the 9th annual ACM international workshop on Web information and data management, pages 105–112, New York, NY, USA, 2007. ACM.

ISBN 978-1-59593-829-9.

doi: http:

//doi.acm.org/10.1145/1316902.1316920. [49] David Gleich, Leonid Zhukov, and Pavel Berkhin. Fast parallel PageRank: A linear system approach. Technical report, Yahoo! Research Labs, 2004. URL http://research.yahoo.com/publications/38.pdf. [50] Scott Golder and Bernardo A. Huberman. rative tagging systems. HP Labs, 2005.

The structure of collabo-

Technical report, Information Dynamics Lab,

URL http://www.isrl.uiuc.edu/∼amag/langev/paper/

golder05taggingSystems.html. [51] Peter Grzybek, editor. Contributions to the Science of Text and Language. Springer, 2006.



[52] Peter Grzybek. On the systematic and system-based study of grapheme frequencies - a re-analysis of german letter frequencies. In G. Altmann, K.-H. Best, and P. Grzybek et al., editors, Glottometrics 15, pages 82–91. RAM Verlag, L¨ udenscheid, 2007. [53] Taher H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford Library Technologies Project, 1999. URL citeseer.ist. psu.edu/haveliwala99efficient.html. [54] Taher H. Haveliwala. Topic-sensitive PageRank. In Proc. of the eleventh International Conference on World Wide Web, pages 517–526. ACM Press, 2002. ISBN 1-58113-449-5. doi: http://doi.acm.org/10.1145/511446.511513. [55] Taher H. Haveliwala et al. 2001 Crawl of the WebBase project, 2001. URL http://dbpubs.stanford.edu:8091/∼testbed/doc2/WebBase/. [56] Marti A. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 9–16, Morristown, NJ, USA, 1994. Association for Computational Linguistics. doi: http://dx.doi.org/10.3115/981732.981734. [57] Marti A. Hearst. Clustering versus faceted categories for information exploration. Commun. ACM, 49(4):59–61, 2006. [58] Marti A. Hearst, Ame Elliott, Jennifer English, Rashmi R. Sinha, Kirsten Swearingen, and Ka-Ping Yee. Finding the flow in web site search. Commun. of the ACM, 45(9):42–49, 2002. [59] James Hendler, Nigel Shadbolt, Wendy Hall, Tim Berners-Lee, and Daniel Weitzner. Web science: an interdisciplinary approach to understanding the web. Commun. ACM, 51(7):60–69, 2008. ISSN 0001-0782. doi: http://doi.acm. org/10.1145/1364782.1364798. [60] Katja Hofmann and Wouter Weerkamp. Web Corpus Cleaning using Content and Structure. In Building and Exploring Web Corpora, pages 145–154. UCL Presses Universitaires de Louvain, September 2007.



[61] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985. URL http://ideas.repec.org/a/ spr/jclass/v2y1985i1p193-218.html. [62] Glen Jeh and Jennifer Widom. Scaling personalized web search. In WWW ’03, pages 271–279, New York, NY, USA, 2003. ISBN 1-58113-680-3. doi: http://doi.acm.org/10.1145/775152.775191. [63] Maryam Kamvar, Melanie Kellar, Rajan Patel, and Ya Xu. Computers and iphones and mobile phones, oh my!: a logs-based comparison of search users on different devices. In WWW ’09: Proceedings of the 18th international conference on World wide web, pages 801–810, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-487-4. doi: http://doi.acm.org/10.1145/1526709.1526817. [64] Sepandar Kamvar, Taher Haveliwala, Christopher Manning, and Gene Golub. Exploiting the block structure of the web for computing PageRank. Technical report, Stanford University, 2003. URL citeseer.ist.psu.edu/article/ kamvar03exploiting.html. [65] Sepandar D. Kamvar, Taher H. Haveliwala, and Gene H. Golub. Adaptive methods for the computation of PageRank. Technical report, Stanford University, 2003. URL citeseer.ist.psu.edu/kamvar03adaptive.html. [66] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Extrapolation methods for accelerating PageRank computations. In Proc. of the 12th Intl. Conf. on the World Wide Web, pages 261–270, 2003. ISBN 1-58113-680-3. [67] Hung-Yu Kao, Jan-Ming Ho, and Ming-Syan Chen. Wisdom: Web intrapage informative structure mining based on document object model. Knowledge and Data Engineering, IEEE Transactions on, 17(5):614–627, May 2005. ISSN 1041-4347. doi: 10.1109/TKDE.2005.84. [68] Maurice G. Kendall. Rank Correlation Methods. Hafner, New York, USA, 1955.



[69] Sung Jin Kim and Sang Ho Lee. An improved computation of the PageRank algorithm. In Proc. of the European Conference on Information Retrieval (ECIR), pages 73–85, 2002. URL citeseer.ist.psu.edu/kim02improved.html. [70] David P. Koester, Sanjay Ranka, and Geoffrey C. Fox. A parallel gauss-seidel algorithm for sparse power system matrices. In Supercomputing ’94: Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pages 184–193, New York, NY, USA, 1994. ACM. ISBN 0-8186-6605-6. doi: http://doi.acm.org/10. 1145/602770.602806. [71] Reinhard K¨ohler. Elemente der synergetischen Linguistik. In Glottometrika 12, pages 179–187. Brockmeyer, Bochum, 1990. [72] Reinhard K¨ohler. Synergetic linguistics. In Quantitative Linguistics – An International Handbook, pages 760–774. de Gruyter, 2005. [73] Christian Kohlsch¨ utter. A Densitometric Classification of Web Template Content. In Emmerich Kelih, Viktor Levickij, and Gabriel Altmann, editors, Methods of Text Analysis: Omnibus volume, pages 133–155. Chernivtsi: CNU, 2009. [74] Christian Kohlsch¨ utter. A Densitometric Analysis of Web Template Content. In WWW ’09: Proceedings of the 18th International World Wide Web Conference, pages 1165–1166, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-487-4. doi: http://doi.acm.org/10.1145/1526709.1526909. [75] Christian Kohlsch¨ utter and Wolfgang Nejdl. A Densitometric Approach to Web Page Segmentation. In CIKM ’08: Proceedings of the 17th ACM conference on Information and Knowledge Management, pages 1173–1182, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-991-3. doi: http://doi.acm.org/10.1145/ 1458082.1458237. [76] Christian Kohlsch¨ utter, Paul-Alexandru Chirita, and Wolfgang Nejdl. Using Link Analysis to Identify Aspects in Faceted Web Search. In SIGIR’2006 Workshop on Faceted Search, Seattle, WA, USA, August 2006.



[77] Christian Kohlsch¨ utter, Paul-Alexandru Chirita, and Wolfgang Nejdl. Efficient parallel computation of pagerank. In ECIR 2006: Advances in Information Retrieval 2006: 28th European Conference on IR Research, volume LNCS 3936, London, UK, April 2006. [78] Christian Kohlsch¨ utter, Paul-Alexandru Chirita, and Wolfgang Nejdl. Utility analysis for topically biased PageRank. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 1211–1212, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-654-7. doi: http://doi.acm.org/10.1145/ 1242572.1242770. [79] Christian Kohlsch¨ utter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detection using shallow text features. In WSDM ’10: Proceedings of the third ACM International Conference on Web search and Data Mining, pages 441– 450, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-889-6. doi: http: //doi.acm.org/10.1145/1718487.1718542. [80] Amy N. Langville and Carl D. Meyer. Deeper inside PageRank. Internet Mathematics, 1(3):335–380, 2004. [81] Daniel Lavalette. A general purpose ranking variable with applications to various ranking laws. In Peter Grzybek and Reinhard K¨ohler, editors, Exact Methods in the Study of Language and Text, pages 371–382. de Gruyter, 2007. [82] Chris P. Lee, Gene H. Golub, and Stefanos A. Zenios. A fast two-stage algorithm for computing PageRank. Technical report, Stanford University, 2003. [83] Sonya Liberman and Ronny Lempel. Approximately optimal facet selection. In The 4th Workshop on the Future of Web Search, April 2009. URL http: //research.yahoo.com/files/facets_ysite.pdf. [84] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, and Denis Turdakov. Accuracy estimate and optimization techniques for simrank computation. The VLDB Journal, 19(1):45–66, 2010. ISSN 1066-8888. doi: http://dx.doi.org/10.1007/ s00778-009-0168-8.



[85] Qing Lu and Lise Getoor. Link-based text classification. Text-Mining & LinkAnalysis Workshop TextLink 2003, 2003. [86] Bundit Manaskasemsak and Arnon Rungsawang. Parallel PageRank computation on a gigabit pc cluster. In Proc. of the 18th International Conference on Advanced Information Networking and Application (AINA’04), 2004. [87] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715. [88] Adam Mathes. Folksonomies - cooperative classification and communication through shared metadata. Computer Mediated Communication - LIS590CMC, Graduate School of Library and Information Science, University of Illinois Urbana-Champagin, December 2004. [89] Frank McSherry. A Uniform Approach to Accelerated PageRank Computation. In Proceedings of the 14th international World Wide Web Conference, pages 575–582, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-046-9. doi: http://doi.acm.org/10.1145/1060745.1060829. [90] Sundaresan Naranan and Viddhachalam K. Balasubrahmanyan. Power laws in statistical linguistics and related systems. In Quantitative Linguistics – An International Handbook, pages 716–738. de Gruyter, 2005. [91] Iadh Ounis, Craig Macdonald, Maarten de Rijke, Gilad Mishne, and Ian Soboroff. Overview of the trec 2006 blog track. In Ellen M. Voorhees and Lori P. Buckland, editors, TREC, volume Special Publication 500-272. National Institute of Standards and Technology (NIST), 2006. [92] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.


PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. URL citeseer.ist.psu.edu/ page98pagerank.html.



[93] Jeff Pasternack and Dan Roth. Extracting article text from the web with maximum subsequence segmentation. In WWW ’09: Proceedings of the 18th international conference on World wide web, pages 971–980, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-487-4. doi: http://doi.acm.org/10.1145/ 1526709.1526840. [94] Ioan-Iovitz Popescu. On a Zipf’s Law Extension to Impact Factors. In Glottometrics 6. RAM Verlag, L¨ udenscheid, 2003. [95] Ioan-Iovitz Popescu and Gabriel Altmann. Some aspects of word frequencies. In Glottometrics 13, pages 23–46. RAM Verlag, L¨ udenscheid, 2006. [96] Feng Qiu and Junghoo Cho. Automatic identification of user interest for personalized search. In Proc. of the 15th international World Wide Web conference, 2006. [97] Davood Rafiei, Krishna Bharat, and Anand Shukla. Diversifying web search results. In WWW ’10: Proceedings of the 19th international conference on World wide web, pages 781–790, New York, NY, USA, 2010. ACM. ISBN 9781-60558-799-8. doi: http://doi.acm.org/10.1145/1772690.1772770. [98] Stephen Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:2004, 2004. [99] Stephen Robertson, Steve Walker, Susan Jones, Micheline M. HancockBeaulieu, and Mike Gatford. Okapi at trec-3. pages 109–126, 1996. [100] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986. ISBN 0070544840. [101] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/361219.361220.



[102] Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval. Commun. ACM, 26(11):1022–1036, 1983. ISSN 0001-0782. doi: http: //doi.acm.org/10.1145/182.358466. [103] Karthikeyan Sankaralingam, Simha Sethumadhavan, and James C. Browne. Distributed Pagerank for P2P Systems. In Proc. of the 12th IEEE Intl. Symp. on High Performance Distributed Computing (HPDC), page 58, 2003. ISBN 0-7695-1965-2. [104] Tamas Sarlos, Andras A. Benczur, Karoly Csalogany, Daniel Fogaras, and Balazs Racz. To randomize or not to randomize: Space optimal summaries for hyperlink analysis. In Proc. of the 15th international World Wide Web conference, 2006. [105] Claude E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379–423, 623–656, October 1948. URL http: //cm.bell-labs.com/cm/ms/what/shannonday/paper.html. [106] Shu-Ming Shi, Jin Yu, Guang-Wen Yang, and Ding-Xing Wang. Distributed Page Ranking in Structured P2P Networks. In Proc. of the 2003 International Conference on Parallel Processing (ICPP’03), pages 179–186, 2003. [107] Amit Singhal and Marcin Kaszkiel. A case study in web search using TREC algorithms. In WWW ’01: Proceedings of the 10th international conference on World Wide Web, pages 708–716, New York, NY, USA, 2001. ACM. ISBN 1-58113-348-0. doi: http://doi.acm.org/10.1145/371920.372186. [108] Jared M. Spool, Tara Scanlon, Carolyn Snyder, Will Schroeder, and Terri DeAngelo. Web site usability: a designer’s guide. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999. ISBN 1-55860-569-X. [109] Miroslav Spousta, Michael Marek, and Pavel Pecina. Victor: the web-page cleaning tool. In WaC4, 2008.



[110] George Stockman and Linda G. Shapiro. Computer Vision. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2001. ISBN 0130307963. [111] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583–617, 2003. ISSN 1533-7928. [112] Juhan Tuldava. Stylistics, author identification. In Quantitative Linguistics – An International Handbook, pages 368–387. de Gruyter, 2005. [113] Fiona J. Tweedie.

Statistical models in stylistics and forensic linguistics.

In Quantitative Linguistics – An International Handbook, pages 387–397. de Gruyter, 2005. [114] Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, Joao M. B. Cavalcanti, and Juliana Freire. A fast and robust method for web page template detection and removal. In CIKM ’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 258–267, 2006. ISBN 1-59593-433-2. doi: http://doi.acm.org/10.1145/1183614. 1183654. [115] Relja Vulanovic and Reinhard K¨ohler. Quantitative Linguistics - An international Handbook, chapter Syntactic units and structures, pages 274–291. de Gruyter, 2005. [116] Yuan Wang and David J. DeWitt. Computing PageRank in a distributed internet search system. In Proceedings of the 30th VLDB Conference, 2004. [117] Gejza Wimmer and Gabriel Altmann. Thesaurus of univariate discrete probability distributions. Stamm Verlag, 1999. [118] Gejza Wimmer and Gabriel Altmann. Unified derivation of some linguistic laws. In Quantitative Linguistics – An International Handbook, pages 791–807. de Gruyter, 2005.



[119] Alex Wright. Ready for a web os? Commun. ACM, 52(12):16–17, 2009. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/1610252.1610260. [120] Jie Wu and Karl Aberer. Using SiteRank for P2P Web Retrieval, March 2004. URL citeseer.ist.psu.edu/wu04using.html. [121] Yiming Yang, Sean Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3): 219–241, 2002. URL citeseer.ist.psu.edu/478602.html. [122] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti A. Hearst. Faceted metadata for image search and browsing. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pages 401–408, 2003. [123] Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In KDD ’03: Proc. of the 9th ACM SIGKDD int. conf. on Knowledge discovery and data mining, pages 296–305, 2003. ISBN 1-58113-7370. doi: http://doi.acm.org/10.1145/956750.956785. [124] Yangbo Zhu, Shaozhi Ye, and Xing Li. Distributed pagerank computation based on iterative aggregation-disaggregation methods. In Proc. of the 14th ACM international conference on Information and knowledge management, 2005. [125] George K. Zipf. Human Behavior and the Principle of Least Effort. AddisonWesley, Reading, 1949.