Blog, Forum or Newspaper? Web Genre Detection using SVMs Philipp Berger, Patrick Hennig, Martin Schönberg, and Christoph Meinel Hasso Plattner Institute 06.12.2015
Project Scope
■ Prototype called Blog Intelligence ■ Several million blog posts ■ Personalized Analysis of the whole corpus ■ Meaningful Visualizations ■ Smart Search Engine ■ blog-intelligence.com
[email protected]
Blog, Forum or Newspaper?
■ Web genre detection task ■ Blogs become a versatile publishing platform □ Personal blogs □ Community blogs □ Forums □ News portals □ Homepages □ Shopping Portals □…
[email protected]
Amazing Variety Corporate Blogs vs. Blog Communities
[email protected]
Amazing Variety Fashion Blog vs News Portal
[email protected]
Amazing Variety News Communities vs. Travel Blogs vs. Cooking Blogs
[email protected]
Definition is Key - Our Perspective ■ What is … □ … a blog? – Diary-like website from one or more authors – Creates a social graph by linking called blogosphere
□ … a forum? – Multiple users discuss on various threads – High diversity of authors
□ … a news portal? – High number of authors / one pseudonym – High number of articles – More objective
[email protected]
Related Work Automatic Website Categorization ■ SVMs for the Blogosphere Blog Identification and Splog Detection ■ Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, Anupam Joshi University of Maryland Baltimore County, 2006 ■ “Is this a blog or not?” ■ ~97% F-Measure ■ ● 2600 manually annotated blogs ● Support Vector Machines ● Cross Validation ● 9 years ago
[email protected]
General Approach ML-supported Classification (SVM)
[email protected]
Feature Pool Ngrams
■ Extract the most significant ngrams (size 4-6) ■ Restrict to text areas □ content □ links □ page url □ title □ anchor tags
[email protected]
Feature Pool Human curated keywords □ Title and URL
[email protected]
Feature Pool
Website clues and ratios ■ Social links to □ Facebook, GooglePlus, Twitter, Youtube ■ Search boxes ■ Text areas and submit buttons (comment section) ■ #Links to length of text ■ #Images to length of text ■ #Videos to length ftext
[email protected]
Problem of SVM Evaluation Many Parameters and Folds… ■ 22040 parameter combination
[email protected]
Evaluation
Spinn3r Dataset
[email protected]
Evaluation Single Features
[email protected]
Evaluation Single Features
■ Best Single Features □ ngram of content only has already accuracy of 78% □ Manual keywords only 50% accuracy
[email protected]
Evaluation
Combinations
[email protected]
Evaluation
Combinations
■ Ngram-based classification already has overall accuracy of 78% ■ Highest accuracy 83.5% overall categories □ Best features – Ngrams of content – Ngrams of URLs – Quanitities (number of links, images, …)
[email protected]
□ No manual keywords and ratios
Conclusion
■ Definition of categories for blog-like websites ■ Implemented various features ■ Extensive evaluation ■ 83.5% overall accuracy
[email protected]
Thank you for your attention! Contact me:
[email protected]