This paper focuses on automatic author extraction from HTML documents as part of ... date, the author email to contact him(her) and even the name of the orga-.
Automatic Web pages author extraction Sahar Changuel, Nicolas Labroche, and Bernadette Bouchon-Meunier Laboratoire d’Informatique de Paris 6 (LIP6) DAPA, LIP6 104, Avenue du Pr´esident Kennedy, 75016, Paris, France {Sahar.Changuel, Nicolas.Labroche, Bernadette.Bouchon-Meunier}@lip6.fr
Abstract. This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.
1
Introduction
The Web has become the major source of information that disseminates news and documents at an incredible speed. With this rapid increase of information, locating the relevant resources is becoming more and more difficult. One approach to make the Web more understandable to machines is the Semantic Web [1], where resources are enriched with descriptive information called metadata. Metadata are commonly known as a kind of structure data about data that can describe contents, semantics and services of data, playing a central role in supporting resources description and discovery. Basic metadata about a document are: its title, its author, its date of publication, its keywords and its description [2]. Although manual annotations are considered as the main source of information for the Semantic Web, the majority of existing HTML pages are still poorly equipped with any kind of metadata. Hence automatic metadata extraction is an attractive alternative for building the Semantic Web. The three main existing methods to generate metadata automatically are[14]: - Deriving metadata: creating metadata based on system properties. - Harvesting metadata: gathering existing metadata, ex: META tags found in the header source code of an HTML resource. - Extracting metadata: pulling metadata from resource content which may employ sophisticated indexing and classification algorithms.
This paper focuses on automatic author extraction from HTML documents as part of a more global application on automatic metadata extraction from learning resources. The author of a resource is the responsible for its creation, it can be a person or an organization. It allows users to judge the credibility of the resource content [3] and can also serve as a searchable record for browsing on digital libraries: a user can look for a course by choosing the professor’s name as a query. Hence automatically annotating the author field can be of great interest to help finding appropriate information. In HTML documents, people can explicitly specify the author on the Meta tag , however, people seldom do it carefully. In our work, the author field was evaluated on our dataset which contains 354 HTML pages, we found that only 15% of the META author fields are filled, therefore an alternative method should be adopted for author extraction. This paper proposes a machine learning technique for automatic author extraction from web documents based on the Decision Tree (C4.5) algorithm. For each HTML page, person names are extracted and features are generated based on spatial and contextual information. Features corresponding to the same person are then combined in a disjunctive manner, this combination improves considerably the extraction results. The rest of the paper is organized as follows. In section 2, previous related works are described, and in section 3, we give specifications on the HTML page author. Section 4 describes our method of author extraction as well as the features construction method. Section 5 presents the experimental results. We make concluding remarks in section 6.
2
Related work
Web information extraction has become a popular research area and many issues have been intensively investigated. Automatic extraction of web information has many applications such as cell phone and PDA browsing [4], automatic annotation of web pages with semantic information [5] and text summarization [6]. There are two main approaches to web information extraction (IE): namely, the rule based approach [7] inducing a set of rules from a training set, and the machine learning based approach which learns statistical models or classifiers. The machine learning based approach is more widely employed and systems differ from each other mainly in the features that they use, some use only basic features such as token string, capitalization and token type (word, number, etc.) [8], others use linguistic features such as part-of-speech, semantic information from gazetteer lists and the outputs of other IE systems (ex: named entity recognizers) [9, 10]. In [11], authors proposed a machine learning method for title extraction from HTML pages. They utilize format information such as font size, position, and font weight as features for title extraction. While most Web pages have their titles placed on the beginning of the document with a conspicuous color and
size, the author of the page doesn’t have special visual properties which makes the visual method unsuitable for author extraction. To the best of our knowledge the only work on author extraction from HTML document is realized by the authors of [3] who proposed a method for author name candidates ranking from Web pages in Japanese language. They use features derived from document structure as well as linguistic knowledge, and rank candidates using the Ranking SVM model. As their approach relies on the distance from the main content, the method fails when the author name occurs inside the main content. Our approach resolves this problem by merging the features of the different occurrences of a person name in a page into a sole and representative one. The merging of features improves remarkably the extraction results. A well known drawback of the supervised machine learning method is the manual annotation of the input data set. In this paper a semi automatic annotation method is adopted. It consists of iteratively expand the corpus by extracting the authors from new HTML pages. It uses a learning model which is constructed on few manually annotated pages, the human main task will consists on the verification of the suggested annotations.
3
Web page author
This paper focuses on the problem of automatically extracting authors from the bodies of HTML documents assuming that the extraction is independent from the Web page structure. In this paper, Web page author is considered as the person responsible for the content of the document. Authors can have common specifications: - The author of an HTML document can be composed of a first name or/and a last name. - It is generally placed in the beginning of the page or after the main content of the document. - Next to the author name, we can find a mention of the document creation date, the author email to contact him(her) and even the name of the organization he(she) belongs to. - Some vocabulary can be used to help recognizing the author’s name, like: “author, created by, written by, etc. ”, a list of words of interest was constructed. For example, in the pages shown in figure 1, “Richard Fitzpatrick” is the author of the page A1 and “Jason W.Hinson” is the author of the page B2 . We assume that there is only one author for a Web page. 1 2
http://farside.ph.utexas.edu/teaching/em/lectures/node54.html http://www.physicsguy.com/ftl/html/FTL intro.html
Fig. 1. Examples of Web pages authors
4
Author extraction method
In this paper, a machine learning approach is conducted to address the problem of Web documents authors extraction. Before the training phase, we need a preprocessing phase to prepare the input data. The global schema of the features construction is illustrated in figure 2. 4.1
HTML page parsing
In order to analyze an HTML page for content extraction, it is transformed first to a well-formed XML document using the open source HTML syntax checker, Jtidy 3 . The resulting document is then parsed using the Cobra toolkit4 , which creates its Document Object Model Tree5 (DOM tree) representation. We consider the HTML element as the root node of the tree. Our content extractor navigates the DOM tree and get the text content from the leaf nodes. 4.2
Person names extraction
Person names (PNs) are extracted from the textual nodes of the DOM tree. For this purpose, some Named entities recognition systems like Balie6 (baseline information extraction) and Lingpipe [12] were tried first. While these systems give good results when trained and applied to a particular genre of text, they make many more errors with heterogeneous text found on the Web. Instead, our method is based on the gazetteer approach to extract PNs from web pages using the frequently occurring US first names list 7 . Using this list 3 4 5 6 7
http://jtidy.sourceforge.net/ http://lobobrowser.org/cobra.jsp www.w3.org/DOM http://balie.sourceforge.net/ http://www.census.gov/genealogy/names/names files.html
Fig. 2. Features construction
results in a simpler algorithm and permits to extract the names more accurately than sophisticated named entities extraction systems. We use only the US first names list because the US last names list contains some common words like White, Young, Price, etc. which can cause extracting common words as PNs and hence generates labeling noise. Indeed, we created some regular expressions based on capitalization to extract the total name from the first name. 4.3
Context window extraction
Since in IE the context of the word is usually as important as the word itself, our approach needs to take into account the words neighboring each person name occurrence. Each PN is considered with 15 words to either side as a “window of context” around the candidate object. The size 15 was chosen experimentally (see section 5.3 for more details). However, in an HTML document, the context window does not only rely on the number of tokens but also on the page layout and structure: the text in an HTML document is generally divided in different visual blocs. If a PN is situated in a given bloc of the page, its context window should not contain tokens from other blocs. In the example in figure 38 , the context window of the author name “Mircea NICOLESCU” is composed of the highlighted tokens, the window should contain only words that are on the same bloc as the PN, hence the left window of the current example contains only the phrase “Created by”. In this paper, a DOM-based method was adopted to construct the contextual window. Our approach exploits the tags generally used to represent blocs in 8
http://www.cse.unr.edu/ mircea/Teaching/cpe201/
Fig. 3. Context window extraction
HTML pages, such as “HR, TABLE, H1, P, and DIV”. They are used to refine the context window as follow: for the left window, text nodes situated before one of these tags are excluded, and likewise for the right window, nodes situated after one of these tags are not taken into account. Hence, for each PN occurrence, its visual bloc is detected and its context window is constructed. The former will be used to extract some contextual information required for the construction of the features. 4.4
Features construction
• Spatial information The author name is generally placed either before the main content of the page or in its footer part. The position of the PN relatively to the page (PR) can be considered as an important information for extracting the author. P osition We define PR = maxP ageDepth , where position is the position of the current node in the DOM tree and maxPageDepth is the total number of textual nodes in the tree. Two thresholds were fixed experimentally: a beginning threshold equal to 0.2 and an end threshold equal to 0.9. This paper supposes that when PR is inferior to 0.2, the text is situated in the beginning of the page, when it is superior to 0.9, it is then placed at the end of the page, and when it is between the two thresholds, the text is located in the main content of the page. The principal issue is how to fix both thresholds. This was done by making different experiments with different thresholds values and we retain those which give the better results (more details are given in the section 5.3).
• Contextual information To get contextual information related to each PN in an HTML document, features are extracted from the context window based on the following information:
- Date: To detect whether there is a date in the contextual window, special regular expressions were created. - Email: Regular expressions are created for email detection, moreover hyperlinks are also exploited by using the ‘mailto’ links. - Organization: The named entities recognition system Balie was applied to detect the occurrence of the organization entities in the context window. - Vocabulary: Two features indicating the existence of words from the author gazetteer were also constructed. The author gazetteer contains two lists, the first includes words like “author, creator, writer, contact...” and the second contains a list of verbs such as “created, realized, founded...”. These words are semantically interesting for Web page author recognition. - An additional feature was created to point out the existence of the the preposition “by” preceding the PN. This feature is kept apart since in some pages it can be the only information provided next to the author. 9 binary features are created for each PN, 3 spatial features and 6 contextual features. 4.5
Merging features
For each PN in a Web document, a feature vector is created. One of the problems encountered is that the author name can occur more than once in the document and each occurrence encloses more or less richer contextual information. Authors in [3] proposed a ranking method by giving a rank to each author name candidate. As our aim is to extract the author from a web page and not to rank its occurrences, the solution proposed in this paper consists of merging the feature vectors using a disjunction method (operator OR). An example is given in the figure 4, suppose that “John Smith” is the author of a Web page, and that its name occurs 3 times in the page, hence, we have three feature vectors for each candidate, V1 = [1,0,1,0,1,0,0,0,1,1], V2 = [0,1,0,0,0,0,1,0,0,1], and V3 = [1,0,1,0,0,0,0,0,0,1]. The key idea is to construct a feature vector ‘V’ representing all the occurrence of “John Smith” in the page, V would be the disjunction of the three vectors: V = [V1 OR V2 OR V3] = [1,1,1,0,1,0,1,0,1,1]. This method gives richer and more powerful information for each person name in a page, moreover, it eliminates poor examples that can affect the training model, section 5.3 shows the merging features effect on the extraction result.
5 5.1
Algorithms and evaluations Algorithms
This paper uses a supervised learning technique to extract authors from HTML documents. The algorithm learns a model from a set of examples, let {(x1 ,y1 )...(xn ,yn )} be a two-class training dataset, with xi a training feature vector constructed for
Fig. 4. Features merging
each person name and their labels yi (1 for the class ‘author’ and -1 for the class ‘non author’). A PN is labeled as author if it can approximately match the annotated page author, the approximate match is true when at least 70% of the tokens in the annotated author are contained in the PN. We used supervised learning methods implemented in Weka [13] to train our classifier. Through experimentation, we found that the Decision Tree implementation (C4.5) provided the best classification performance. As baseline methods, we used the extraction of the author from the meta Author tag (metaAutor), and the OneR (one rule) classifier. The OneR model seeks to generate classification rules using a single attribute only, we use the implementation of a one-rule classifier provided in the Weka toolkit. To evaluate the author extraction method, Precision, Recall and F1-measure are used as evaluation metrics. 5.2
Data
Data was first collected manually by sending queries to the Web, and among the resulting pages a human annotator selects those which contain their authors in their contents. As human annotation is time consuming, and, in this case, require looking among numerous Web pages to find few interesting ones, annotation is stopped when 100 annotated pages are obtained. But, a dataset of 100 pages is not enough representative especially that the accuracy of a learned model usually increases with the number of training examples. In order to expand the dataset, we adopt a semi automatic technique that can be explained as follows: - A Decision Tree model is trained first on the features extracted from the existing annotated pages.
- New pages are acquired by sending different queries to a Web search engine using the Google API. - PNs are extracted from the resulting pages and their feature vectors are constructed as explained in section 3. - The learning model is then applied on these features, and pages that contain instances classified as ‘author’ are retained. These pages will be labeled by the human annotator. This phase is heavily based on the system’s suggestions and the annotator main task is correcting and integrating the suggested annotations. The latter has now a small number of pages to parse. - Features created from the new annotated pages are added to the previous training examples. The process is repeated a certain number of times, and each time different queries are chosen in order to get different HTML pages. Our corpus has grown from 100 to 354 pages more quickly and with less human effort. Within the context of our global application of automatic metadata extraction from learning resources, we are especially interested in extracting information from the education domain, thus, the words of the queries belong to the education lexicon. Ex: Analog electronics, Molecular Biology, Operating systems, Human Sciences, etc. Even if the new pages are obtained from already annotated ones, our corpus contains heterogeneous HTML documents which are content and structure independent. 5.3
Experiments
This section summarizes the results of the different experiments on author extraction. In the experiments, we conducted 10-fold cross validation, and thus all the results are averaged over 10 trials. The input examples of our experiments are the binary feature vectors related to the persons names found in the annotated HTML documents. Each example is labeled by the class ‘author’ or ‘non author’. The results are summarized in table 1 and indicate that our method significantly outperforms the baseline methods. Our learning-based method can make an effective use of various types of information for author extraction. The metaAutor method can correctly extract only about 15% of the authors from the Meta tags of the HTML pages. Table 1. Performances of baseline methods for author extraction Method C4.5 OneR MetaAuthor
Precision 0.812 0.704 1
Recall 0.805 0.61 0.149
F1-measure 0.809 0.654 0.253
Effectiveness of merging features: Table 2 shows the results obtained before and after applying the features disjunctive combination for the different PN candidates. Table 2. Evaluation results before and after features combination (Positive = The number of examples labeled as ‘author’, and Negative = The number of examples labeled as ‘non author’). Method Precision Non combined features 0.782 Combined features 0.812
Recall 0.604 0.805
F1-measure Positive Negative 0.681 533 1961 0.809 354 1911
The results indicate that combining the features enhances notably the results, particularly the recall have improved of about 21%. This can be explained by the fact that the recall is affected by the number of items incorrectly classified as ‘non author’. Without merging the features this number is high since a page can contain more than one author name and some candidates have poor contextual information which causes them to be incorrectly classified as ‘non authors’ by the model. Parameters effectiveness: Figure 5 shows the experimental results in term of F1-measure obtained with different parameters. The curve C shows the results while changing the size of the context window. With a small window size we can miss some useful information and larger sizes can induce to more noise in the dataset. A window size of 15 seems to enclose the relevant context information and to give better results. Curve A shows the training model results changes while varying the values of the beginning threshold and fixing the end threshold value to 0.9, and in curve B, we fixe the beginning threshold to 0.2 and we vary the end threshold values. These curves enhance the reason of choosing 0.2 and 0.9 as beginning and end thresholds to delimit the main content of a Web page. Both values give the best results in term of F1-measure. Dataset size effectiveness: During the dataset expansion, the performance evolution of our system is evaluated. The curve D in figure 5 summarizes the results and shows that the performance of the model improves when the number of annotated HTML pages increases. Feature Contribution We further investigate the contribution of each feature type in author extraction. Experiments was done using each category of feature separably (the spatial features and the contextual features). Table 3 summarizes the results.
0.85 0.84 0.83
A
0.8
B
0.82 0.81 0.8 0.1
0.15
0.2
0.75
0.25
Beginning threshold
0.65
0.7
0.75
End threshold
0.8
0.85
0.9
0.9
0.9
C
0.6 1
1
D
0.8
0.8 0.7
0.7 0.6 5
0.6
10
15
20
25
0.5 100
150
200
250
300
Data set size
Window size
Fig. 5. Parameters effectiveness
The results indicate that one type of features alone is not sufficient for accurate author extraction. With the spatial features we obtain a better precision value whereas with the contextual features we get a better recall. Information on the position are insufficient for extracting all the authors from the HTML pages, the contextual information give more completeness to the result. Table 3. Contribution of each feature type Feature subset Spatial features contextual features
6
Precision 0.828 0.743
Recall 0.339 0.636
F1-measure 0.481 0.685
Conclusion
This paper provides a new approach to extract automatically the author from a set of heterogeneous Web documents. The author is an essential component for judging the credibility of a resource content. To address the problem, our method uses a machine learning approach based on the HTML structure as well as on contextual information. The method adopted in this paper extracts the author name from the body of
350
the HTML document, if this information is absent in the content of the page other methods should be adopted like the stylometry approach which is often used to attribute authorship to anonymous documents. Future directions include discovering other fields of metadata from HTML pages so as to enrich resources and to make them more accessible.
References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001). 2. Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 33 (2007) 135-146 3. Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: WICOW 08: Proceeding of the 2nd ACM workshop on Information credibility on the web, New York, NY, USA, ACM (2008) 35-42 4. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: WWW 03: Proceedings of the 12th international conference on World Wide Web, New York, NY, USA, ACM (2003) 207214 5. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, t., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165 (2005) 91-134 6. Evans, D., Klavans, J.L., McKeown, K.R.: Columbia newsblaster: Multilingual news summarization on the web. In: Proceedings of Human Language Technology conference of the North American. (2004) 7. Ciravegna, F.: lp) 2 , an adaptive algorithm for information extraction from webrelated texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining. (2001) 8. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press / The MIT Press (2000) 577-583 9. Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: SIGIR 04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, ACM Press (2004) 273-280 10. Nadeau, D., Turney, P., Matwin, S.: Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. (2006) 266-277 11. Changuel, S., Labroche, N., Bouchon-meunier, B.: A general learning method for automatic title extraction from html pages. In: International Conference on Machine Learning and Data Mining (MLDM 09), Leipzig Germany. (2009) 12. Alias-i. 2006. LingPipe Natural Language Toolkit.http://www.aliasi.com/lingpipe. 13. 20. Ian H. Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Diane Cerra (2005) 14. Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 320 (2006)