Blog Detection

9 downloads 14739 Views 10MB Size Report
Jun 12, 2015 - Blog, Forum or Newspaper? Web Genre Detection using SVMs. Philipp Berger, Patrick Hennig, Martin Schönberg, and. Christoph Meinel.
Blog, Forum or Newspaper? Web Genre Detection using SVMs Philipp Berger, Patrick Hennig, Martin Schönberg, and Christoph Meinel Hasso Plattner Institute 06.12.2015

Project Scope

■ Prototype called Blog Intelligence ■ Several million blog posts ■ Personalized Analysis of the whole corpus ■ Meaningful Visualizations ■ Smart Search Engine ■ blog-intelligence.com

[email protected]

Blog, Forum or Newspaper?

■ Web genre detection task ■ Blogs become a versatile publishing platform □ Personal blogs □ Community blogs □ Forums □ News portals □ Homepages □ Shopping Portals □…

[email protected]

Amazing Variety Corporate Blogs vs. Blog Communities

[email protected]

Amazing Variety Fashion Blog vs News Portal

[email protected]

Amazing Variety News Communities vs. Travel Blogs vs. Cooking Blogs

[email protected]

Definition is Key - Our Perspective ■ What is … □ … a blog? – Diary-like website from one or more authors – Creates a social graph by linking called blogosphere

□ … a forum? – Multiple users discuss on various threads – High diversity of authors

□ … a news portal? – High number of authors / one pseudonym – High number of articles – More objective

[email protected]

Related Work Automatic Website Categorization ■ SVMs for the Blogosphere Blog Identification and Splog Detection ■ Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, Anupam Joshi University of Maryland Baltimore County, 2006 ■ “Is this a blog or not?” ■ ~97% F-Measure ■ ● 2600 manually annotated blogs ● Support Vector Machines ● Cross Validation ● 9 years ago

[email protected]

General Approach ML-supported Classification (SVM)

[email protected]

Feature Pool Ngrams

■ Extract the most significant ngrams (size 4-6) ■ Restrict to text areas □ content □ links □ page url □ title □ anchor tags

[email protected]

Feature Pool Human curated keywords □ Title and URL

[email protected]

Feature Pool

Website clues and ratios ■ Social links to □ Facebook, GooglePlus, Twitter, Youtube ■ Search boxes ■ Text areas and submit buttons (comment section) ■ #Links to length of text ■ #Images to length of text ■ #Videos to length ftext

[email protected]

Problem of SVM Evaluation Many Parameters and Folds… ■ 22040 parameter combination

[email protected]

Evaluation

Spinn3r Dataset

[email protected]

Evaluation Single Features

[email protected]

Evaluation Single Features

■ Best Single Features □ ngram of content only has already accuracy of 78% □ Manual keywords only 50% accuracy

[email protected]

Evaluation

Combinations

[email protected]

Evaluation

Combinations

■ Ngram-based classification already has overall accuracy of 78% ■ Highest accuracy 83.5% overall categories □ Best features – Ngrams of content – Ngrams of URLs – Quanitities (number of links, images, …) [email protected]

□ No manual keywords and ratios

Conclusion

■ Definition of categories for blog-like websites ■ Implemented various features ■ Extensive evaluation ■ 83.5% overall accuracy

[email protected]

Thank you for your attention! Contact me:

[email protected]