Generalized and Lightweight Algorithms for Automated ... - Google Sites

0 downloads 128 Views 593KB Size Report
limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify
Generalized and Lightweight Algorithms for Automated Web Forum Content Extraction Wee-Yong Lim1 , Vyjayanthi Raja2 and Vrizlynn L. L. Thing1 Cybercrime and Security Intelligence Department, Institute for Infocomm Research 1 Fusionopolis Way, 138632, Singapore Email: {weylim, vriz}1 @i2r.a-star.edu.sg, [email protected]

Abstract—As online forums contain a vast amount of information that can aid in the early detection of fraud and extremist activities, accurate and efficient information extraction from forum sites is very important. In this paper, we discuss the limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extraction while eliminating noisy data from forum sites. In this paper, we propose three generalized and lightweight algorithms to carry out accurate thread and post content extraction from web forums. We evaluate our algorithms based on two strict criteria and to the granularity of the (DOM tree) node level correctness. We consider a thread or post as successfully extracted by our algorithms only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experiments on ten different forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively. Index Terms—Online forums, information retrieval, content extraction, web intelligence.

I. I NTRODUCTION The widespread use and contribution of knowledge in the form of data uploaded to the Internet have made it a wealthy information source for any conceivable topic. One of the most important platforms on the Web is the online forums. Online web forums’ dynamically expanding contents, contributed by Internet users on a daily basis has led to its increasing richness of information. Its widespread popularity is due to its facilitation of global, convenient, fast and freely open discussions. As web forum data is an accumulation of a vast collection of updated human knowledge and viewpoints, it can thus be a highly valuable source of online information for knowledge acquisition to build up domain expertise [1], improve business intelligence [2], and early detection of the presence of extremist activities [3], [4]. Regardless of the application, the fundamental step to forum data mining is to fetch and extract valuable data from the various forum sites distributed on the Internet, in an efficient manner. Efficiency is measured by the retrieval of valuable data while avoiding noise such as irrelevant links and advertisements. One of the earliest techniques of generic web crawling is the breadth-first crawling approach [5]. The crawler starts with an

initial queue of URLs and proceeds to dequeue the URLs on downloading the corresponding pages. Each downloaded page is then parsed to extract any outlinks. These outlinks are added to the end of the queue for processing. The process continues until a preset maximum number of URLs is reached. However, the main limitation of the breadth-first crawling approach is the storage space constraint. In [6], a topic specific crawler is proposed. The basis of this work is that two neighbouring pages are semantically related; a conclusion drawn by the authors in [5]. The crawler employs a human assisted approach to learn a particular topic and identify documents related to the topic. Therefore, [6] is one of the earliest work that describes a focused and targeted approach of crawling. Although generic web crawlers are effective for applications such as search engines and topic-based web information retrieval, they are not suitable for forum crawling. This is due to the unique structure of forum sites and the associative human discussion behavior, which is evident in the dynamic nature of the forum contents and its presentation. [7]–[11] describe novel approaches to forum crawling in which the forum structure is learned dynamically by the crawler through forum page sampling. The different types of forum pages are detected and the site traversal path are derived by the forum crawlers. The relevant pages are then downloaded from the forum sites in the operational mode. The main problem with the downloaded pages is the presence of noise as forum pages are commonly populated with irrelevant data and links. These links may be for external sites and advertisements, which are not necessary for the purpose of data analysis. Therefore, an important step is to retrieve valuable data from the downloaded pages by extracting the relevant forum tables and lists, while avoiding the noise. In this paper, we firstly discuss the challenges and shortcomings of existing approaches in the area of relevant information extraction from forum pages. Next, we propose three lightweight algorithms to carry out automated information retrieval from forum sites in a generic and scalable manner. We evaluate the performance of our algorithms by carrying out experiments on ten forum sites. Our proposed algorithms are shown to achieve excellent results in the ability to significantly reduce noise during the information extraction phase in an automatic manner. The rest of the paper is organized as follows. In Section II, we discuss the existing work in the field of both web site and forum site data extraction. We present a brief overview

Fig. 1. Table Structure of Board/Thread List

Fig. 2. Table Headers and Column Headers

Fig. 3. Mismatched styling attributes to aid in identifying outliers

of forum sites and our observations in Section III. Our proposed algorithms are presented and discussed in Section IV. Experimental results are presented and discussed in Section V. Conclusions follow in Section VI. II. R ELATED W ORK Several works exist that attempt to extract data from web pages and forums. The first approach is wrapper based [12]– [15], and uses supervised machine learning to learn data extraction rules from the positive and negative samples. The structural information of the sample web pages is utilized to classify similar data records according to their subtree structure. However, this approach is inflexible and non-scalable due to its site template dependency. Manual labeling of the sample pages is also extremely labor intensive and time consuming, and has to be repeated even for pages within the same site

due to varying intra-site templates. Another approach [16], [17] relies on the web page visual attributes. The web pages are rendered by the web browser to ascertain the data structure on different pages. Features such as the positioning of the information units, the cell sizes, the font characteristics and the font colors are analysed to understand the semantic meaning of the contents based on the assumption that content rendering by the browser ensures human understandable output format. The results based on the data structure inference from the visual attributes are observed to be at 81% and 68% for the precision and recall measurement, respectively. In [18], the authors identify different data blocks based on the differences in their visual styles such as the width, height, background color and font. The disadvantage of the visual attribute based approach is the need to render the web pages during both the learning and extraction phases,

TABLE I DEFINITION OF TABLES AND THEIR CORRESPONDING ROWS AND COLUMNS Table ,,
    ,

    Row


  • A combination of one and



    with id attribute set

    A sequence (set of siblings) of two or more container elements at any depth within the element. If no such sequence is present, the immediate children of the element constitute the only row in this table. Exception: If the table element is the first and only child of the table element, it is selected as the only column and the start of a new table.

    which is computationally expensive. Probabilistic model approaches take into consideration the semantic information in the web sites and generate models to aid in the data extraction. In [19], the authors observed certain strongly linked sequence characteristics among web objects of the same type across different web sites. They presented a twodimensional conditional random fields model to incorporate the two-dimensional neighborhood interactions of the web objects so as to detect and extract product information from web pages. In [17], the same authors presented a new model of hierarchical conditional random fields to couple the data record detection and attribute labeling phases to benefit from the semantics availability in the attribute labeling phase. However, these two approaches assume an adequate availability of semantic information specific to the domain type of the web sites (e.g. product information sites in this case). In [20], the authors proposed parsing the web page to form a tag tree based on the start and end tags. The primary content region is located by extracting the minimal subtree which contains all the objects of interest. Three features, namely the fanout, content size and tag count, are relied upon to choose the correct subtree. However, the evaluation of [20] in [21] shows that the proposed methods could not obtain good results in the data extraction from web pages. In [21], the authors proposed an algorithm to only consider nodes with a tree depth of at least three (derived from the observation of web page contents), and extract the data region based on nodes with a high string-based similarity [22]. However, the proposed algorithm requires a training phase to derive the string-based similarity threshold and is specific to different web sites. In [23], the authors rely on text density and composite text density measurements to support content extraction. Text density is the ratio of the number of characters to the number

    Column , A sequence (set of siblings) of two or more container elements at any depth within the row element. If no such sequence is present, the immediate children of the row element constitute the only column in this table. Exception: If the table element is the first and only child of the row element, it is selected as the only column and the start of a new table. A sequence (set of siblings) of two or more container elements at any depth within the element. If no such sequence is present, the immediate children of the element constitute the only column in this table. Exception: If the table element is the first and only child of the element, it is selected as the only column and the start of a new table. A sequence (set of siblings) of two or more container elements at any depth within the row element. If no such sequence is present, the immediate children of the row element constitute the only column in this table. Exception: If the table element is the first and only child of the row element, it is selected as the only column and the start of a new table. A sequence (set of siblings) of two or more container elements at any depth within the row element. If no such sequence is present, the immediate children of the row element constitute the only column in this table. Exception: If the table element is the first and only child of the row element, it is selected as the only column and the start of a new table.

    of tags in a region. The basis for using text density is that the text regions in a page generally contain a higher text ratio as compared to the other regions. Composite text density is computed by taking into consideration the noise due to hyperlinks, and gives a high score for the content containing a low number of hyperlinks (indicating region of interest). However, this method is not applicable to board and list-ofthread page content extraction. In addition, the achieveable granularity of the content extraction from the post pages is not to the required level of individual posts, which is desired in forum data extraction. In [24], the authors proposed generating the sitemap, which is a directed graph representing the forum site, by first sampling 2000 pages. The vertices of the graph represent the pages, while the edges denote the links between the pages. The authors then proposed extracting three features, which are the i) inner-page feature to capture characteristics such as the presence of time information, and whether the elements on a page are aligned by rendering via a web browser to identify elements’ locations and whether the time information present a special order (i.e. to identify post records due to its sequential post time order), ii) inter-vertex feature to capture site-level knowledge such as to indicate if a vertex leads to another vertex with post pages and whether their joining edge is defined as a post link, and iii) inner-vertex feature to capture the alignment of nodes within a vertex such as whether they share a similar DOM path and tag attributes, from the sampled pages. Based on the extracted features, the Markov Logic Network models are generated for each forum site. However, manual labelling of the sample pages is required in the training phase and is extremely labor intensive and time consuming. In addition, the sitemap construction and feature extraction processes have to be carried out during the operation phase and

    therefore, create a bottleneck during information extraction. In [25], the authors proposed using the properties of links present in forum pages to aid in designing a more generic algorithm to extract content from forums. During the training phase, the signature and common keyword for the reference links are identified automatically to find the dominant XPath for the reference links and page-flipping links for each type of forum pages. Subsequently, the best XPath for differentiating between different content regions in the pages is derived to enable the contents to be extracted appropriately in discrete units during the forum content extraction. Experiments on 9 full forum sites showed the content extraction at 99.88% to 100% recall rate and 99.99% to 100% precision rate from thread pages, and 97.71% to 100% recall rate and 99.9% to 100% precision rate from post pages. III. F ORUM S ITE O BSERVATIONS In general, there are three main types of pages that contain relevant information in forum sites. They are the board page, thread page and the post page. Board Page contains a list of board records, where each board record contains a link to a thread page and some other information relating to the specific board, such as the number of threads, number of posts and most recent posts. Usually, the homepage of a forum site is the board page. Thread Page contains a list of thread records, where each thread record contains a link to the first page of a discussion thread (i.e. a post page) and other information relating to the specific thread, such as the author of the latest post, time of last post and number of views. Post Page contains a list of post records, where each post record contains the content of the post (e.g. text, links, images), the name of the author of the post, the time of posting, and additional details about the author, such as the number of total posts by the author and the date of joining. Observations We observe that the board and thread lists on their respective pages are typically organised in a tabular format where each row of the table represents an individual board or thread record, and each column provides specific details about the record (as shown in Figure 1). We also observe that the board and thread tables are typically preceded by headers (either a header for the entire table, or a header for each column). Thus, we define two types of headers encountered on board and thread pages, namely, the table headers and the column headers (as shown in Figure 2). To retrieve information within a table or from a specific column (to support further granularity of extraction), the retrieval algorithm should support both cases when supplied with a table header or a column header. A detailed inspection of the HTML source of forum sites also reveals that the board and thread lists are typically encapsulated in the HTML table or list structures.

    The defined header types also form either the first row of the board or thread lists, or are present in a sibling table immediately above the board or thread lists. Post pages are not structured in a similar way as the board and thread pages. The post lists are not organised in a tabular format with well-defined headers. However, the contents of a single post record are typically encapsulated within a single entity of enclosing tags, and all post enclosing tags are sibling nodes. Nonetheless, post records are usually interleaved with advertisements and other noisy records. An example of a HTML snippet is shown in Figure 3, where three of the four records are post records and their enclosing tags contain the same attribute and value sets. The second record with the different styling attribute in the enclosing tag is the outlier. Tto eliminate these noisy information, we propose to verify that the extracted post records are aligned and have the same style features during post records extractions, and to eliminate those contents that do not fulfil this requirement. In addition, another observation that is striking and distinctive for post records, is that posts are typically messages from users in the context of a discussion and generally encompass the bulk of the area on a post page. Therefore, this factor can be taken into consideration to extract the post records from post pages. IV. P ROPOSED A LGORITHMS Based on our observations of forum sites, we propose a generalized and lightweight approach to carry out information extraction from board, thread and post pages. First, we define our classification of a comprehensive set of HTML tags. Our classification approach is based on carrying out an exhaustive study of the list of HTML 4.01 tag definitions in [26] and forum sites’ observations. We classify the information and elements as follow. Raw Information The basic information unit (i.e. text and HTML links). Tags - , Informative Elements These elements are used to encase and provide styling attributes for raw information. It is assumed that they do not provide any organisational or structural attributes to the data. Tags - , , , , , , , , , , to , ,

    , , , , , , , , Note - The tag is listed as a raw information as well as an informative element. The reason is that, during data extraction based on DOM tree traversal, the link will be extracted from the tag and the traversal will continue so as to extract the associated text, if any is present. Container Elements These elements are used to group units of data together in a page, and to provide the structural organisation to the page.

    TABLE II TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR TABLE HEADER CONTEXT Table 1) 2) 3) 4)

    Identification and Extraction Steps for Table Header Context Since the context is TABLE, this is immediately assumed to be a table header. If a table element immediately follows the header’s table element, this is the first table to be extracted. Execution jumps to step 4. If no such table element follows the header’s table element, no results are reported. The tag of this table element is noted. Also, the number of informative columns of the first row of this table are noted where an informative column is defined as containing at least one child element that holds raw information, is informative or a container. Let this number be n. 5) All the rows of this first table are extracted unless a row with a column number not equal to n is encountered, in which case, extraction terminates. 6) If extraction does not terminate at step 5, for all subsequent table elements that follow the first table, if the table element tag is the same as that of the first, extracts all rows with n columns in these tables. Terminate extraction when the first non-conforming table is encountered or if a particular row in the table does not contain n columns.

    TABLE III TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR ROW HEADER CONTEXT Table Identification and Extraction Steps for Row Header Context 1) If the row element of this header has no siblings, then it is a table header. Otherwise, it is a column header. 2) If the header is a column header, the position of the row amongst its siblings is noted. Let this position be p. Only informative rows are considered when calculating this position. Also, the total number of header rows n is noted. 3) If the table element of this row is immediately followed by another table element, this is the first table to be extracted. In the case of the column header, the contents of column p at each row of this table are extracted. Otherwise, the entire content of each row with n columns is extracted unless a row without n columns is encountered. Execution skips to step 5, if extraction does not terminate. 4) If no such table element follows this row’s table element, no results are reported. 5) For all tables that are subsequent siblings of this first table, they are selected for extraction if their table element is the same as that of the first one and if their rows contain n columns. Extraction proceeds accordingly, depending on whether the header is a column or table header. Terminates extraction when the first non-conforming table is encountered or in the case of a table header, if a non-conforming row is encountered.

    TABLE IV TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR COLUMN HEADER CONTEXT Table Identification and Extraction for Column Header Context 1) If the column element of this header has no siblings, then it is a table header. Otherwise, it is a column header. 2) If the header is a column header, the position of the column amongst its siblings is noted. Let this position be p. Only informative columns are considered when calculating this position. Also, the total number of header columns n is noted. 3) First, subsequent rows to this column’s row element are searched for. If they exist, they are assumed to be the rows of the table corresponding to this header. Depending on whether the header is a column or table header, data is extracted accordingly following which, extraction terminates. 4) If no subsequent rows are detected, a table element that immediately follows this column’s table is searched for. If it exists, depending on whether the header is a column or table header, data is extracted accordingly. Execution skips to step 6, unless extraction terminates. 5) If no such table exists, no results are reported. 6) For all tables that are subsequent siblings of this first table, they are selected for extraction if their table element is the same as that of the first one and if their rows contain n columns. Extraction proceeds accordingly, depending on whether the header is a column or table header. Terminates extraction terminates when first non-conforming table is encountered or in the case of a table header, if a non-conforming row is encountered.

    Tags - , , , , ,

  • ,
  • ,
    , , Table Elements These elements fall into a special class of container elements that are used to group data on a webpage into a tabular or list format. Tags - ,
      , , , , with the id attribute set Note - The data within a
    tag is often grouped into sections enclosed in , and tags. Each of them is considered a table element as well. When a
    tag is encountered in the DOM tree, it is replaced by a , and set to form an enclosure of the affected nodes. In addition, a element, with its id attribute set, can be used to indicate the start of a new section or block in a webpage, or in a forum page, the start of a board, thread or post list. In contrast, a element, without an id attribute set, is only used to add styling attributes to the text it encloses.

    Sibling Elements HTML elements in a web page with the same XPath are termed as sibling elements (or nodes). Note - , and elements are assumed to possess the same XPath as the enclosing
    elements. Therefore, we consider the , and tags within a single
    tag to be siblings of each other. All tags that are not defined above (e.g. , ) are assumed to be “noise” and are ignored by the algorithms during extraction. We also define the concept of “context” to be a table, row or column for describing the exact type of the element (or node) being processed. A. Definition of Table We regard the presence of a table element tag as the start of a table in this work. However, in order to extract data from specific rows or columns in a table, we need to further devise the definitions of what constitute a table, its corresponding

    rows and columns, in the language of HTML tags. We present these definitions and describe them in Table I. From Table I, it is clear that row elements in most tables are defined according to the HTML 4.01 specifications. However, a element is merely defined as “a section of the document” in HTML 4.01, and so, the demarcation of rows and columns within a must be identified through other means. Considering the case where a with an id attribute is in fact encasing the board or thread list, it seems logical that the individual records would be a sequence of container elements that are immediate children of the table element. However, in the source HTML, the path from the table element to the row elements is often interleaved with encasing container elements that are present in order to add certain style or alignment attributes to the entire list. Therefore, the first sequence of the container elements (with two or more elements in the sequence) is selected as the rows for this table. The only exception to this rule, as explained in Table I, is the case where another table element is the first and only child of a with an id attribute table. The explanation for the column selection rules for
  • , , and with an id attribute rows is similar. The column elements defined for , , conform to the HTML 4.01 specifications. B. Board and Thread Extraction Algorithm Based on our observations of board and thread table structures in forum sites, we devise the following algorithm for extracting the relevant board or thread lists and columns. 1) Starting with the node, traverse the web page DOM tree in a depth-first manner. 2) For each node, determine if the text value of the node matches any of the input headers. If so, mark the node for data extraction and execute step 7. 3) If a table element is encountered, traverse the DOM subtree of this node with the context set to TABLE. Next, traverse the DOM subtrees of all the siblings of this table element with the TABLE context as well. 4) When the context is set to TABLE, discern the rows of the table using the heuristics in Table I and traverse each row’s DOM subtree with the context set to ROW. 5) When the context is set to ROW, discern the columns of the row using the heuristics in Table I and traverse each column’s DOM subtree with the context set to COLUMN. 6) Traversal ends when every node of the DOM tree is visited (except for the “noise” nodes and their subtrees). 7) Depending on the context in which the header is encountered, locate the tables corresponding to this header and extract data according to Table II, III or IV based on the nature of the header (i.e. table or column header) C. Post Extraction Algorithms Based on our observations of post lists in forum sites, we devise the following algorithm for extracting the relevant post lists and posts.

    1) Beginning with the node, traverse the DOM tree in a depth-first manner. 2) When processing a single node, add its first child as the first element of a potential candidate set. 3) Assign each subsequent child to the most likely potential candidate set based on two criteria. The first criteria is that this subsequent child and the first node in the existing potential candidate set can be reached by the same XPath, therefore, implying that they are siblings. The second criteria is that this subsequent child and the first node in the existing potential candidate set are of the same HTML tag and share the same attributes and values set (except for the id attribute, which is required to be unique for each element on a page). Subsequent child nodes which does not fit into any existing set will be placed into new set(s). 4) When all the children of the current node have been visited, add potential candidate sets with more than one member to the final candidate set list. Nodes in these sets are not processed further. For the remaining nodes, they are traversed and processed according to Step 2 to 4. 5) After the traversal terminates, compute the average text content for each final candidate set. 6) The candidate set with the highest average text content is selected as the post list and the post data is extracted from each of its members. In addition to the above core algorithm, we propose another post extraction algorithm by incorporating the Tree Edit Distance (TED) filtering. With TED filtering, we ensure that the DOM subtrees under each element are similar before carrying out content extraction. Therefore, non-similar elements are not extracted (i.e. avoided during the extraction) based on this algorithm. We quantify the similarity between two trees based on the Zhang-Shasha dynamic programming algorithm [27]. Next, we implement our algorithms and carry out the experiments for performance evaluation. V. E XPERIMENT AND A NALYSIS In this section, we describe the design of the experiment setup to evaluate the performance of our proposed algorithms. We implement our algorithms and the Zhang-Shasha algorithm, and utilize the libcurl, tidy and rapidxml libraries for downloading, cleaning up and parsing of web pages in forum sites respectively. We retrieve 50 thread lists (i.e. pages) and 50 post tables from each forum, and consider each text and anchor node within the thread or post contents as the positive nodes (i.e. content nodes of interest). The rest of the nodes in the DOM tree are considered to be negative nodes (i.e. noise that we do not want to retrieve). As a strict evaluation criteria, a thread or post is regarded as extracted successfully by our algorithms (thereby, counted as a true positive) only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. We show the results of our evaluation for the thread and post contents (without and with

    TABLE V THREAD EXTRACTION RESULTS Forum

    Total Nodes

    eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam

    82348 107617 95729 94250 62606 150918 65637 110053 105975 76150

    Positive Nodes 23236 33200 8207 31995 15546 52562 14696 19903 29749 19259

    Extracted Nodes 23236 34570 8207 34171 15746 53468 14696 19903 29749 19259

    TP Nodes

    Recall

    Precision

    23236 33200 8207 31995 15546 52562 14696 19903 29749 19259

    1 1 1 1 1 1 1 1 1 1

    1 0.960 1 0.936 0.987 0.983 1 1 1 1

    TABLE VI POST EXTRACTION RESULTS (LIGHTWEIGHT CORE ALGORITHM WITHOUT TED FILTERING) Forum

    Total Nodes

    eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam

    51055 77384 66785 72896 40037 35945 116743 96111 60977 38322

    Positive Nodes 8939 15442 7842 12854 8938 3306 34351 13070 13613 5656

    Extracted Nodes 8939 15442 7842 12854 8938 3290 34351 13070 13613 5656

    TP Nodes

    Recall

    Precision

    8939 15442 7842 12854 8938 3220 34351 13070 13613 5656

    1 1 1 1 1 0.974 1 1 1 1

    1 1 1 1 1 0.979 1 1 1 1

    TABLE VII POST EXTRACTION RESULTS (WITH TED FILTERING) Forum

    Total Nodes

    eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam

    51081 77383 66785 72894 40038 35939 116965 96489 55743 38339

    Positive Nodes 8947 15442 7842 12854 8938 3306 34431 13070 12042 5656

    Extracted Nodes 8259 14674 7171 6616 8691 2731 21953 10034 8857 5656

    TED filtering) extraction in Table V, VI and VII, respectively. We observe that all the thread records are extracted successfully, resulting in a 100% recall rate. However, for some forums, there are additional nodes (i.e. redundant or noise information) being extracted from some pages. The average thread extraction precision rate is 0.9866. The reason is that one or more user-defined table headers are present in the content of the thread records, which leads to such content being identified as relevant table headers too. Therefore, it leads to nodes linked to both the correct and mistakenly identified headers, being extracted. Currently, the comparison in the algorithm relies on sub-string matching to allow the algorithm to support a more lenient and generic form of web forum extraction. To prevent the extraction of additional redundant nodes (i.e. to increase the precision rate), a strict string matching can be applied in the algorithm instead. We also observe that for most web forums, we obtain a 100% recall and precision rate in our post extraction results using the core post extraction algorithm (i.e. without TED filtering). However, for the scamfound forum, we notice a few

    TP Nodes

    Recall

    Precision

    8259 14674 7171 6536 8691 2633 21953 10034 8657 5656

    0.923 0.950 0.914 0.508 0.972 0.796 0.638 0.768 0.719 1

    1 1 1 0.988 1 0.964 1 1 0.977 1

    list-of-posts pages with unusually short text contents. Therefore, for these pages, it leads to the selection of the wrong candidate set for extraction. The average recall and precision rates for our core post extraction algorithm (without TED filtering) are 0.9974 and 0.9979, respectively. To resolve the issue of low average text length in the post records, we attempt to exploit the difference in the DOM structure by carrying out the TED filtering in our second post extraction algorithm. However, we observe that despite incurring an additional overhead to compute the TED, our algorithm with TED filtering obtains a poorer result compared to our core post extraction algorithm. The average recall and precision rates for our post extraction algorithm (with TED filtering) are 0.8188 and 0.9929, respectively. While the precision rate is comparable to the core post extraction algorithm, the additional step of TED filtering may be too restrictive a filtering process, resulting in dropping the correct candidates due to the large intra-TED among the post records. Currently, the TED computation operations are given equal weights. However, for the subtree of a post record, it can be

    observed that there are a few common structures corresponding to data such as the user name, join date, post date, etc. at the beginning of the tree when traversing in a pre-order manner. On the other hand, the tree structure corresponding to the user posted content usually varies. Therefore, as a future work, we suggest carrying out an automated adaptive weight assignment to the TED computation operations according to the current hierarchical level of the nodes. VI. C ONCLUSIONS In this paper, we discussed the need for an accurate and efficient information extraction from forum sites. The limitations of existing methods such as the requirements to carry out sample collection, manual sample labelling and model training, amount to a substantial effort in human intervention and can be extremely time-consuming. Another limitation is the need to generate models specifically for different forum sites and intra-site templates, which results in scalability issues. The use of computationally intensive algorithms and techniques such as the rendering of web pages during the operational mode, also incur severe overhead. Therefore, we proposed three generalized and lightweight algorithms for the extraction of information from forum sites. Our contributions include the definition and classification of the related HTML 4.01 tags through an exhaustive study of the specifications. Based on our observations of various forum sites, we then proposed three generally applicable algorithms to support the extraction of threads and posts from forum sites. We evaluate our algorithms based on two strict criteria (to the granularity of node level correctness measurement). We only consider a thread/post as successfully extracted by our algorithms (thereby, counted as a true positive) only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experimental results showed that our proposed algorithms achieve a high efficiency in the extraction of relevant information and avoidance of noise and redundant content from forum sites. Our experiments on ten forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively. R EFERENCES [1] J. Zhang, M. S. Ackerman, and L. Adamic, “Expertise networks in online communities: structure and algorithms,” WWW Conference, pp. 221– 230, 2007. [2] N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo, “Deriving marketing intelligence from online discussion,” ACM SIGKDD International Conference on Knowledge discovery in Data Mining, pp. 419–428, 2005. [3] Y. Zhang, S. Zeng, L. Fan, Y. Dang, C. A. Larson, and H. Chen, “Dark web forums portal: searching and analyzing jihadist forums,” IEEE International Conference on Intelligence and Security Informatics, pp. 71–76, 2009. [4] Y. Zhou, J. Qin, G. Lai, and H. Chen, “Collection of u.s. extremist online forums: A web mining approach,” Annual Hawaii International Conference on System Sciences, p. 70, 2007.

    [5] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998. [6] S. Chakrabarti, M. van den Berg, and B. Dom, “Focused crawling: a new approach to topic-specific web resource discovery,” Computer Networks: The International Journal of Computer and Telecommunications Networking, vol. 31, no. 11-16, pp. 1623–1640, 1999. [7] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: An intelligent crawler for web forums,” WWW Conference, pp. 447–456, 2008. [8] Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring traversal strategy for web forum crawling,” ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 459–466, 2008. [9] H.-M. Ying and V. L. L. Thing, “An enhanced intelligent forum crawler,” IEEE Symposium on Computational Intelligence for Security and Defence Applications, 2012. [10] A. Sachan, W. Y. Lim, and V. L. L. Thing, “A generalized links and text properties based forum crawler,” IEEE/WIC/ACM Web Intelligence Conference, 2012. [11] J. Joy and M. A., “Automated path ascend forum crawling,” Internation Journal of Engineering Research and Technology, vol. 2, no. 3, 2013. [12] W. W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in html documents,” WWW Conference, pp. 232–241, 2002. [13] N. Kushmerick, “Wrapper induction: efficiency and expressiveness,” Artificial Intelligence - Special issue on Intelligent Internet Systems, vol. 118, no. 1-2, pp. 15–68, 2000. [14] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” Annual Conference on Autonomous Agents, pp. 190– 197, 1997. [15] S. Zheng, R. Song, J.-R. Wen, and D. Wu, “Joint optimization of wrapper generation and template detection,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 894–902, 2007. [16] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak, “Towards domain-independent information extraction from web tables,” WWW Conference, pp. 71–80, 2007. [17] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “Simultaneous record detection and attribute labeling in web data extraction,” ACM SIGKDD International Conference on Knowledge discovery in Data Mining, pp. 494–503, 2006. [18] M. Asfia, M. M. Pedram, and A. M. Rahmani, “Main content extraction from detailed web pages,” International Journal of Computer Applications, vol. 4, no. 11, pp. 18–21, August 2010. [19] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “2d conditional random fields for web information extraction,” International Conference on Machine Learning, pp. 1044–1051, 2005. [20] D. Buttler, L. Liu, and C. Pu, “A fully automated object extraction system for the world wide web,” IEEE International Conference on Distributed Computing Systems, pp. 361–370, 2001. [21] B. Liu, R. Grossman, and Y. Zhai, “Mining data records from web pages,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606, 2003. [22] D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. [23] F. Sun, D. Song, and L. Liao, “Dom based content extraction via text density,” International ACM SIGIR Conference on Research and Development in Information, pp. 245–254, 2011. [24] J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma, “Incorporating site-level knowledge to extract structured data from web forums,” WWW Conference, pp. 181–190, 2009. [25] W. Y. Lim, A. Sachan, and V. L. L. Thing, “A lightweight algorithm for automated forum information processing,” IEEE/WIC/ACM Web Intelligence Conference, 2013. [26] W3Schools, “Html 4.01/xhtml 1.0 reference,” June 2012. [27] K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,” SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989.

    Suggest Documents